great_expectations.datasource.data_connector.util

Module Contents

Functions

batch_definition_matches_batch_request(batch_definition: BatchDefinition, batch_request: BatchRequestBase)

map_data_reference_string_to_batch_definition_list_using_regex(datasource_name: str, data_connector_name: str, data_reference: str, regex_pattern: str, group_names: List[str], data_asset_name: Optional[str] = None)

convert_data_reference_string_to_batch_identifiers_using_regex(data_reference: str, regex_pattern: str, group_names: List[str])

_determine_batch_identifiers_using_named_groups(match_dict: dict, group_names: List[str])

map_batch_definition_to_data_reference_string_using_regex(batch_definition: BatchDefinition, regex_pattern: str, group_names: List[str])

convert_batch_identifiers_to_data_reference_string_using_regex(batch_identifiers: IDDict, regex_pattern: str, group_names: List[str], data_asset_name: Optional[str] = None)

_invert_regex_to_data_reference_template(regex_pattern: str, group_names: List[str])

Create a string template based on a regex and corresponding list of group names.

normalize_directory_path(dir_path: str, root_directory_path: Optional[str] = None)

get_filesystem_one_level_directory_glob_path_list(base_directory_path: str, glob_directive: str)

List file names, relative to base_directory_path one level deep, with expansion specified by glob_directive.

list_azure_keys(azure, query_options: dict, recursive: bool = False)

Utilizes the Azure Blob Storage connection object to retrieve blob names based on user-provided criteria.

list_gcs_keys(gcs, query_options: dict, recursive: bool = False)

Utilizes the GCS connection object to retrieve blob names based on user-provided criteria.

list_s3_keys(s3, query_options: dict, iterator_dict: dict, recursive: bool = False)

For InferredAssetS3DataConnector, we take bucket and prefix and search for files using RegEx at and below the level

build_sorters_from_config(config_list: List[Dict[str, Any]])

_build_sorter_from_config(sorter_config: Dict[str, Any])

Build a Sorter using the provided configuration and return the newly-built Sorter.

great_expectations.datasource.data_connector.util.logger
great_expectations.datasource.data_connector.util.BlobPrefix
great_expectations.datasource.data_connector.util.storage
great_expectations.datasource.data_connector.util.pyspark
great_expectations.datasource.data_connector.util.DEFAULT_DATA_ASSET_NAME :str = DEFAULT_ASSET_NAME
great_expectations.datasource.data_connector.util.batch_definition_matches_batch_request(batch_definition: BatchDefinition, batch_request: BatchRequestBase) → bool
great_expectations.datasource.data_connector.util.map_data_reference_string_to_batch_definition_list_using_regex(datasource_name: str, data_connector_name: str, data_reference: str, regex_pattern: str, group_names: List[str], data_asset_name: Optional[str] = None) → Optional[List[BatchDefinition]]
great_expectations.datasource.data_connector.util.convert_data_reference_string_to_batch_identifiers_using_regex(data_reference: str, regex_pattern: str, group_names: List[str]) → Optional[Tuple[str, IDDict]]
great_expectations.datasource.data_connector.util._determine_batch_identifiers_using_named_groups(match_dict: dict, group_names: List[str]) → IDDict
great_expectations.datasource.data_connector.util.map_batch_definition_to_data_reference_string_using_regex(batch_definition: BatchDefinition, regex_pattern: str, group_names: List[str]) → str
great_expectations.datasource.data_connector.util.convert_batch_identifiers_to_data_reference_string_using_regex(batch_identifiers: IDDict, regex_pattern: str, group_names: List[str], data_asset_name: Optional[str] = None) → str
great_expectations.datasource.data_connector.util._invert_regex_to_data_reference_template(regex_pattern: str, group_names: List[str]) → str

Create a string template based on a regex and corresponding list of group names.

For example:

filepath_template = _invert_regex_to_data_reference_template(

regex_pattern=r”^(.+)_(d+)_(d+).csv$”, group_names=[“name”, “timestamp”, “price”],

) filepath_template >> “{name}_{timestamp}_{price}.csv”

Such templates are useful because they can be populated using string substitution:

filepath_template.format(**{

“name”: “user_logs”, “timestamp”: “20200101”, “price”: “250”,

}) >> “user_logs_20200101_250.csv”

NOTE Abe 20201017: This method is almost certainly still brittle. I haven’t exhaustively mapped the OPCODES in sre_constants

great_expectations.datasource.data_connector.util.normalize_directory_path(dir_path: str, root_directory_path: Optional[str] = None) → str
great_expectations.datasource.data_connector.util.get_filesystem_one_level_directory_glob_path_list(base_directory_path: str, glob_directive: str) → List[str]

List file names, relative to base_directory_path one level deep, with expansion specified by glob_directive. :param base_directory_path – base directory path, relative to which file paths will be collected :param glob_directive – glob expansion directive :returns – list of relative file paths

great_expectations.datasource.data_connector.util.list_azure_keys(azure, query_options: dict, recursive: bool = False) → List[str]

Utilizes the Azure Blob Storage connection object to retrieve blob names based on user-provided criteria.

For InferredAssetAzureDataConnector, we take container and name_starts_with and search for files using RegEx at and below the level specified by those parameters. However, for ConfiguredAssetAzureDataConnector, we take container and name_starts_with and search for files using RegEx only at the level specified by that bucket and prefix.

This restriction for the ConfiguredAssetAzureDataConnector is needed, because paths on Azure are comprised not only the leaf file name but the full path that includes both the prefix and the file name. Otherwise, in the situations where multiple data assets share levels of a directory tree, matching files to data assets will not be possible, due to the path ambiguity.

Parameters
  • azure (BlobServiceClient) – Azure connnection object responsible for accessing container

  • query_options (dict) – Azure query attributes (“container”, “name_starts_with”, “delimiter”)

  • recursive (bool) – True for InferredAssetAzureDataConnector and False for ConfiguredAssetAzureDataConnector (see above)

Returns

List of keys representing Azure file paths (as filtered by the query_options dict)

great_expectations.datasource.data_connector.util.list_gcs_keys(gcs, query_options: dict, recursive: bool = False) → List[str]

Utilizes the GCS connection object to retrieve blob names based on user-provided criteria.

For InferredAssetGCSDataConnector, we take bucket_or_name and prefix and search for files using RegEx at and below the level specified by those parameters. However, for ConfiguredAssetGCSDataConnector, we take bucket_or_name and prefix and search for files using RegEx only at the level specified by that bucket and prefix.

This restriction for the ConfiguredAssetGCSDataConnector is needed because paths on GCS are comprised not only the leaf file name but the full path that includes both the prefix and the file name. Otherwise, in the situations where multiple data assets share levels of a directory tree, matching files to data assets will not be possible due to the path ambiguity.

Please note that the SDK’s list_blobs method takes in a delimiter key that drastically alters the traversal of a given bucket:
  • If a delimiter is not set (default), the traversal is recursive and the output will contain all blobs in the current directory as well as those in any nested directories.

  • If a delimiter is set, the traversal will continue until that value is seen; as the default is “/”, traversal will be scoped within the current directory and end before visiting nested directories.

In order to provide users with finer control of their config while also ensuring output that is in line with the recursive arg, we deem it appropriate to manually override the value of the delimiter only in cases where it is absolutely necessary.

Parameters
  • gcs (storage.Client) – GCS connnection object responsible for accessing bucket

  • query_options (dict) – GCS query attributes (“bucket_or_name”, “prefix”, “delimiter”, “max_results”)

  • recursive (bool) – True for InferredAssetGCSDataConnector and False for ConfiguredAssetGCSDataConnector (see above)

Returns

List of keys representing GCS file paths (as filtered by the query_options dict)

great_expectations.datasource.data_connector.util.list_s3_keys(s3, query_options: dict, iterator_dict: dict, recursive: bool = False) → str

For InferredAssetS3DataConnector, we take bucket and prefix and search for files using RegEx at and below the level specified by that bucket and prefix. However, for ConfiguredAssetS3DataConnector, we take bucket and prefix and search for files using RegEx only at the level specified by that bucket and prefix. This restriction for the ConfiguredAssetS3DataConnector is needed, because paths on S3 are comprised not only the leaf file name but the full path that includes both the prefix and the file name. Otherwise, in the situations where multiple data assets share levels of a directory tree, matching files to data assets will not be possible, due to the path ambiguity. :param s3: s3 client connection :param query_options: s3 query attributes (“Bucket”, “Prefix”, “Delimiter”, “MaxKeys”) :param iterator_dict: dictionary to manage “NextContinuationToken” (if “IsTruncated” is returned from S3) :param recursive: True for InferredAssetS3DataConnector and False for ConfiguredAssetS3DataConnector (see above) :return: string valued key representing file path on S3 (full prefix and leaf file name)

great_expectations.datasource.data_connector.util.build_sorters_from_config(config_list: List[Dict[str, Any]]) → Optional[dict]
great_expectations.datasource.data_connector.util._build_sorter_from_config(sorter_config: Dict[str, Any]) → Sorter

Build a Sorter using the provided configuration and return the newly-built Sorter.