great_expectations.datasource.data_connector
¶
Subpackages¶
great_expectations.datasource.data_connector.asset
great_expectations.datasource.data_connector.sorter
great_expectations.datasource.data_connector.sorter.custom_list_sorter
great_expectations.datasource.data_connector.sorter.date_time_sorter
great_expectations.datasource.data_connector.sorter.lexicographic_sorter
great_expectations.datasource.data_connector.sorter.numeric_sorter
great_expectations.datasource.data_connector.sorter.sorter
Submodules¶
great_expectations.datasource.data_connector.batch_filter
great_expectations.datasource.data_connector.configured_asset_azure_data_connector
great_expectations.datasource.data_connector.configured_asset_dbfs_data_connector
great_expectations.datasource.data_connector.configured_asset_file_path_data_connector
great_expectations.datasource.data_connector.configured_asset_filesystem_data_connector
great_expectations.datasource.data_connector.configured_asset_gcs_data_connector
great_expectations.datasource.data_connector.configured_asset_s3_data_connector
great_expectations.datasource.data_connector.configured_asset_sql_data_connector
great_expectations.datasource.data_connector.data_connector
great_expectations.datasource.data_connector.file_path_data_connector
great_expectations.datasource.data_connector.inferred_asset_azure_data_connector
great_expectations.datasource.data_connector.inferred_asset_dbfs_data_connector
great_expectations.datasource.data_connector.inferred_asset_file_path_data_connector
great_expectations.datasource.data_connector.inferred_asset_filesystem_data_connector
great_expectations.datasource.data_connector.inferred_asset_gcs_data_connector
great_expectations.datasource.data_connector.inferred_asset_s3_data_connector
great_expectations.datasource.data_connector.inferred_asset_sql_data_connector
great_expectations.datasource.data_connector.runtime_data_connector
great_expectations.datasource.data_connector.util
Package Contents¶
Classes¶
|
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines |
|
A DataConnector that allows users to specify a Batch’s data directly using a RuntimeBatchRequest that contains |
|
Base-class for DataConnector that are designed for connecting to filesystem-like data, which can include |
|
The ConfiguredAssetFilePathDataConnector is one of two classes (InferredAssetFilePathDataConnector being the |
|
The InferredAssetFilePathDataConnector is one of two classes (ConfiguredAssetFilePathDataConnector being the |
|
Extension of ConfiguredAssetFilePathDataConnector used to connect to Filesystem |
|
Extension of InferredAssetFilePathDataConnector used to connect to data on a filesystem. |
|
Extension of ConfiguredAssetFilePathDataConnector used to connect to S3 |
|
Extension of InferredAssetFilePathDataConnector used to connect to S3 |
|
Extension of ConfiguredAssetFilePathDataConnector used to connect to Azure |
|
Extension of InferredAssetFilePathDataConnector used to connect to Azure Blob Storage |
|
Extension of ConfiguredAssetFilePathDataConnector used to connect to GCS |
|
Extension of ConfiguredAssetFilePathDataConnector used to connect to GCS |
|
A DataConnector that requires explicit listing of SQL tables you want to connect to. |
|
A DataConnector that infers data_asset names by introspecting a SQL database |
|
Extension of ConfiguredAssetFilesystemDataConnector used to connect to the DataBricks File System (DBFS). Note: This works for the current implementation of DBFS. If in the future DBFS diverges from a Filesystem-like implementation, we should instead inherit from ConfiguredAssetFilePathDataConnector or another DataConnector. |
|
Extension of InferredAssetFilesystemDataConnector used to connect to data on a DBFS filesystem. |
-
class
great_expectations.datasource.data_connector.
DataConnector
(name: str, datasource_name: str, execution_engine: ExecutionEngine, batch_spec_passthrough: Optional[dict] = None)¶ DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the Datasource.
For example, a DataConnector could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”
A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_spec” assembled by the data connector, While not every Datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.
-
property
batch_spec_passthrough
(self)¶
-
property
name
(self)¶
-
property
datasource_name
(self)¶
-
property
execution_engine
(self)¶
-
property
data_context_root_directory
(self)¶
-
get_batch_data_and_metadata
(self, batch_definition: BatchDefinition)¶ Uses batch_definition to retrieve batch_data and batch_markers by building a batch_spec from batch_definition, then using execution_engine to return batch_data and batch_markers
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Builds batch_spec from batch_definition by generating batch_spec params and adding any pass_through params
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
- Returns
BatchSpec object built from BatchDefinition
-
abstract
_refresh_data_references_cache
(self)¶
-
abstract
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache by classes that extend this base DataConnector class
- Parameters
data_asset_name (str) – optional data_asset_name to retrieve more specific results
-
abstract
_get_data_reference_list_from_cache_by_data_asset_name
(self, data_asset_name: str)¶ Fetch data_references corresponding to data_asset_name from the cache.
-
abstract
get_data_reference_list_count
(self)¶
-
abstract
get_unmatched_data_references
(self)¶
-
abstract
get_available_data_asset_names
(self)¶ Return the list of asset names known by this data connector.
- Returns
A list of available names
-
abstract
get_available_data_asset_names_and_types
(self)¶ Return the list of asset names and types known by this DataConnector.
- Returns
A list of tuples consisting of available names and types
-
abstract
get_batch_definition_list_from_batch_request
(self, batch_request: BatchRequestBase)¶
-
abstract
_map_data_reference_to_batch_definition_list
(self, data_reference: Any, data_asset_name: Optional[str] = None)¶
-
abstract
_map_batch_definition_to_data_reference
(self, batch_definition: BatchDefinition)¶
-
abstract
_generate_batch_spec_parameters_from_batch_definition
(self, batch_definition: BatchDefinition)¶
-
self_check
(self, pretty_print=True, max_examples=3)¶ Checks the configuration of the current DataConnector by doing the following :
refresh or create data_reference_cache
print batch_definition_count and example_data_references for each data_asset_names
also print unmatched data_references, and allow the user to modify the regex or glob configuration if necessary
select a random data_reference and attempt to retrieve and print the first few rows to user
When used as part of the test_yaml_config() workflow, the user will be able to know if the data_connector is properly configured, and if the associated execution_engine can properly retrieve data using the configuration.
- Parameters
pretty_print (bool) – should the output be printed?
max_examples (int) – how many data_references should be printed?
- Returns
dictionary containing self_check output
- Return type
report_obj (dict)
-
_self_check_fetch_batch
(self, pretty_print: bool, example_data_reference: Any, data_asset_name: str)¶ Helper function for self_check() to retrieve batch using example_data_reference and data_asset_name, all while printing helpful messages. First 5 rows of batch_data are printed by default.
- Parameters
pretty_print (bool) – print to console?
example_data_reference (Any) – data_reference to retrieve
data_asset_name (str) – data_asset_name to retrieve
-
_validate_batch_request
(self, batch_request: BatchRequestBase)¶ - Validate batch_request by checking:
if configured datasource_name matches batch_request’s datasource_name
if current data_connector_name matches batch_request’s data_connector_name
- Parameters
batch_request (BatchRequestBase) – batch_request object to validate
-
property
-
class
great_expectations.datasource.data_connector.
RuntimeDataConnector
(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, batch_identifiers: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.data_connector.DataConnector
A DataConnector that allows users to specify a Batch’s data directly using a RuntimeBatchRequest that contains either an in-memory Pandas or Spark DataFrame, a filesystem or S3 path, or an arbitrary SQL query
- Parameters
name (str) – The name of this DataConnector
datasource_name (str) – The name of the Datasource that contains it
execution_engine (ExecutionEngine) – An ExecutionEngine
batch_identifiers (list) – a list of keys that must be defined in the batch_identifiers dict of RuntimeBatchRequest
batch_spec_passthrough (dict) – dictionary with keys that will be added directly to batch_spec
-
_refresh_data_references_cache
(self)¶
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the cache to create a list of data_references. If data_asset_name is passed in, method will return all data_references for the named data_asset. If no data_asset_name is passed in, will return a list of all data_references for all data_assets in the cache.
-
_get_data_reference_list_from_cache_by_data_asset_name
(self, data_asset_name: str)¶ Fetch data_references corresponding to data_asset_name from the cache.
-
get_data_reference_list_count
(self)¶ Get number of data_references corresponding to all data_asset_names in cache. In cases where the RuntimeDataConnector has been passed a BatchRequest with the same data_asset_name but different batch_identifiers, it is possible to have more than one data_reference for a data_asset.
-
get_unmatched_data_references
(self)¶
-
get_available_data_asset_names
(self)¶ Please see note in : _get_batch_definition_list_from_batch_request()
-
get_batch_data_and_metadata
(self, batch_definition: BatchDefinition, runtime_parameters: dict)¶ Uses batch_definition to retrieve batch_data and batch_markers by building a batch_spec from batch_definition, then using execution_engine to return batch_data and batch_markers
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
-
get_batch_definition_list_from_batch_request
(self, batch_request: RuntimeBatchRequest)¶
-
_get_batch_definition_list_from_batch_request
(self, batch_request: RuntimeBatchRequest)¶ <Will> 202103. The following behavior of the _data_references_cache follows a pattern that we are using for other data_connectors, including variations of FilePathDataConnector. When BatchRequest contains batch_data that is passed in as a in-memory dataframe, the cache will contain the names of all data_assets (and data_references) that have been passed into the RuntimeDataConnector in this session, even though technically only the most recent batch_data is available. This can be misleading. However, allowing the RuntimeDataConnector to keep a record of all data_assets (and data_references) that have been passed in will allow for the proposed behavior of RuntimeBatchRequest which will allow for paths and queries to be passed in as part of the BatchRequest. Therefore this behavior will be revisited when the design of RuntimeBatchRequest and related classes are complete.
-
_update_data_references_cache
(self, data_asset_name: str, batch_definition_list: List, batch_identifiers: IDDict)¶
-
_self_check_fetch_batch
(self, pretty_print, example_data_reference, data_asset_name)¶ Helper function for self_check() to retrieve batch using example_data_reference and data_asset_name, all while printing helpful messages. First 5 rows of batch_data are printed by default.
- Parameters
pretty_print (bool) – print to console?
example_data_reference (Any) – data_reference to retrieve
data_asset_name (str) – data_asset_name to retrieve
-
_generate_batch_spec_parameters_from_batch_definition
(self, batch_definition: BatchDefinition)¶
-
build_batch_spec
(self, batch_definition: BatchDefinition, runtime_parameters: dict)¶ Builds batch_spec from batch_definition by generating batch_spec params and adding any pass_through params
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
- Returns
BatchSpec object built from BatchDefinition
-
static
_get_data_reference_name
(batch_identifiers: IDDict)¶
-
static
_validate_runtime_parameters
(runtime_parameters: Union[dict, type(None)])¶
-
_validate_batch_request
(self, batch_request: RuntimeBatchRequest)¶ - Validate batch_request by checking:
if configured datasource_name matches batch_request’s datasource_name
if current data_connector_name matches batch_request’s data_connector_name
- Parameters
batch_request (BatchRequestBase) – batch_request object to validate
-
_validate_batch_identifiers
(self, batch_identifiers: dict)¶
-
_validate_batch_identifiers_configuration
(self, batch_identifiers: List[str])¶
-
self_check
(self, pretty_print=True, max_examples=3)¶ Overrides the self_check method for RuntimeDataConnector. Normally the self_check() method will check the configuration of the DataConnector by doing the following :
refresh or create data_reference_cache
print batch_definition_count and example_data_references for each data_asset_names
also print unmatched data_references, and allow the user to modify the regex or glob configuration if necessary
However, in the case of the RuntimeDataConnector there is no example data_asset_names until the data is passed in through the RuntimeBatchRequest. Therefore, there will be a note displayed to the user saying that RuntimeDataConnector will not have data_asset_names until they are passed in through RuntimeBatchRequest.
- Parameters
pretty_print (bool) – should the output be printed?
max_examples (int) – how many data_references should be printed?
- Returns
dictionary containing self_check output
- Return type
report_obj (dict)
-
class
great_expectations.datasource.data_connector.
FilePathDataConnector
(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.data_connector.DataConnector
Base-class for DataConnector that are designed for connecting to filesystem-like data, which can include files on disk, but also S3 and GCS.
Note: FilePathDataConnector is not meant to be used on its own, but extended. Currently ConfiguredAssetFilePathDataConnector and InferredAssetFilePathDataConnector are subclasses of FilePathDataConnector.
-
property
sorters
(self)¶
-
_get_data_reference_list_from_cache_by_data_asset_name
(self, data_asset_name: str)¶ Fetch data_references corresponding to data_asset_name from the cache.
-
get_batch_definition_list_from_batch_request
(self, batch_request: BatchRequest)¶ Retrieve batch_definitions and that match batch_request.
- First retrieves all batch_definitions that match batch_request
if batch_request also has a batch_filter, then select batch_definitions that match batch_filter.
if data_connector has sorters configured, then sort the batch_definition list before returning.
- Parameters
batch_request (BatchRequest) – BatchRequest (containing previously validated attributes) to process
- Returns
A list of BatchDefinition objects that match BatchRequest
-
_get_batch_definition_list_from_batch_request
(self, batch_request: BatchRequestBase)¶ Retrieve batch_definitions that match batch_request.
- First retrieves all batch_definitions that match batch_request
if batch_request also has a batch_filter, then select batch_definitions that match batch_filter.
if data_connector has sorters configured, then sort the batch_definition list before returning.
- Parameters
batch_request (BatchRequestBase) – BatchRequestBase (BatchRequest without attribute validation) to process
- Returns
A list of BatchDefinition objects that match BatchRequest
-
_sort_batch_definition_list
(self, batch_definition_list: List[BatchDefinition])¶ Use configured sorters to sort batch_definition
- Parameters
batch_definition_list (list) – list of batch_definitions to sort
- Returns
sorted list of batch_definitions
-
_map_data_reference_to_batch_definition_list
(self, data_reference: str, data_asset_name: str = None)¶
-
_map_batch_definition_to_data_reference
(self, batch_definition: BatchDefinition)¶
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
static
sanitize_prefix
(text: str)¶ Takes in a given user-prefix and cleans it to work with file-system traversal methods (i.e. add ‘/’ to the end of a string meant to represent a directory)
-
_generate_batch_spec_parameters_from_batch_definition
(self, batch_definition: BatchDefinition)¶
-
_validate_batch_request
(self, batch_request: BatchRequestBase)¶ - Validate batch_request by checking:
if configured datasource_name matches batch_request’s datasource_name
if current data_connector_name matches batch_request’s data_connector_name
- Parameters
batch_request (BatchRequestBase) – batch_request object to validate
-
_validate_sorters_configuration
(self, data_asset_name: Optional[str] = None)¶
-
abstract
_get_batch_definition_list_from_cache
(self)¶
-
abstract
_get_regex_config
(self, data_asset_name: Optional[str] = None)¶
-
abstract
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
property
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetFilePathDataConnector
(name: str, datasource_name: str, assets: dict, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.file_path_data_connector.FilePathDataConnector
The ConfiguredAssetFilePathDataConnector is one of two classes (InferredAssetFilePathDataConnector being the other) designed for connecting to filesystem-like data. This includes files on disk, but also things like S3 object stores, etc:
A ConfiguredAssetFilePathDataConnector requires an explicit listing of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup.
Note: ConfiguredAssetFilePathDataConnector is not meant to be used on its own, but extended. Currently ConfiguredAssetFilesystemDataConnector, ConfiguredAssetS3DataConnector, ConfiguredAssetAzureDataConnector, and ConfiguredAssetGCSDataConnector are subclasses of ConfiguredAssetFilePathDataConnector.
-
property
assets
(self)¶
-
_build_assets_from_config
(self, config: Dict[str, dict])¶
-
_build_asset_from_config
(self, config: dict)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this DataConnector.
- Returns
A list of available names
-
_refresh_data_references_cache
(self)¶
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache.
-
get_data_reference_list_count
(self)¶ Returns the list of data_references known by this DataConnector by looping over all data_asset_names in _data_references_cache
- Returns
number of data_references known by this DataConnector.
-
get_unmatched_data_references
(self)¶ Returns the list of data_references unmatched by configuration by looping through items in _data_references_cache and returning data_reference that do not have an associated data_asset.
- Returns
list of data_references that are not matched by configuration.
-
_get_batch_definition_list_from_cache
(self)¶
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
_get_regex_config
(self, data_asset_name: Optional[str] = None)¶
-
_get_asset
(self, data_asset_name: str)¶
-
abstract
_get_data_reference_list_for_asset
(self, asset: Optional[Asset])¶
-
abstract
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset])¶
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
property
-
class
great_expectations.datasource.data_connector.
InferredAssetFilePathDataConnector
(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.file_path_data_connector.FilePathDataConnector
The InferredAssetFilePathDataConnector is one of two classes (ConfiguredAssetFilePathDataConnector being the other one) designed for connecting to filesystem-like data. This includes files on disk, but also things like S3 object stores, etc:
InferredAssetFilePathDataConnector is a base class that operates on file paths and determines the data_asset_name implicitly (e.g., through the combination of the regular expressions pattern and group names)
Note: InferredAssetFilePathDataConnector is not meant to be used on its own, but extended. Currently InferredAssetFilesystemDataConnector, InferredAssetS3DataConnector, InferredAssetAzureDataConnector, and InferredAssetGCSDataConnector are subclasses of InferredAssetFilePathDataConnector.
-
_refresh_data_references_cache
(self)¶ refreshes data_reference cache
-
get_data_reference_list_count
(self)¶ Returns the list of data_references known by this DataConnector by looping over all data_asset_names in _data_references_cache
- Returns
number of data_references known by this DataConnector
-
get_unmatched_data_references
(self)¶ Returns the list of data_references unmatched by configuration by looping through items in _data_references_cache and returning data_references that do not have an associated data_asset.
- Returns
list of data_references that are not matched by configuration.
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this DataConnector
- Returns
A list of available names
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_batch_definition_list_from_cache
(self)¶
-
_get_regex_config
(self, data_asset_name: Optional[str] = None)¶
-
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetFilesystemDataConnector
(name: str, datasource_name: str, base_directory: str, assets: dict, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, glob_directive: str = '**/*', sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of ConfiguredAssetFilePathDataConnector used to connect to Filesystem
The ConfiguredAssetFilesystemDataConnector is one of two classes (InferredAssetFilesystemDataConnector being the other one) designed for connecting to data on a filesystem. It connects to assets defined by the assets configuration.
A ConfiguredAssetFilesystemDataConnector requires an explicit listing of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup.
-
_get_data_reference_list_for_asset
(self, asset: Optional[Asset])¶
-
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset] = None)¶
-
property
base_directory
(self)¶ Accessor method for base_directory. If directory is a relative path, interpret it as relative to the root directory. If it is absolute, then keep as-is.
-
-
class
great_expectations.datasource.data_connector.
InferredAssetFilesystemDataConnector
(name: str, datasource_name: str, base_directory: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, glob_directive: str = '*', sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of InferredAssetFilePathDataConnector used to connect to data on a filesystem.
The InferredAssetFilesystemDataConnector is one of two classes (ConfiguredAssetFilesystemDataConnector being the other one) designed for connecting to data on a filesystem. It connects to assets inferred from directory and file name by default_regex and glob_directive.
InferredAssetFilesystemDataConnector that operates on file paths and determines the data_asset_name implicitly (e.g., through the combination of the regular expressions pattern and group names)
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references.
This method is used to refresh the cache.
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
property
base_directory
(self)¶ Accessor method for base_directory. If directory is a relative path, interpret it as relative to the root directory. If it is absolute, then keep as-is.
-
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetS3DataConnector
(name: str, datasource_name: str, bucket: str, assets: dict, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, prefix: str = '', delimiter: str = '/', max_keys: int = 1000, boto3_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of ConfiguredAssetFilePathDataConnector used to connect to S3
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the Datasource.
The ConfiguredAssetS3DataConnector is one of two classes (InferredAssetS3DataConnector being the other one) designed for connecting to data on S3.
A ConfiguredAssetS3DataConnector requires an explicit listing of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup.
-
static
sanitize_prefix_for_s3
(text: str)¶ Takes in a given user-prefix and cleans it to work with file-system traversal methods (i.e. add ‘/’ to the end of a string meant to represent a directory)
Customized for S3 paths, ignoring the path separator used by the host OS
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list_for_asset
(self, asset: Optional[Asset])¶
-
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset] = None)¶
-
static
-
class
great_expectations.datasource.data_connector.
InferredAssetS3DataConnector
(name: str, datasource_name: str, bucket: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, prefix: str = '', delimiter: str = '/', max_keys: int = 1000, boto3_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of InferredAssetFilePathDataConnector used to connect to S3
The InferredAssetS3DataConnector is one of two classes (ConfiguredAssetS3DataConnector being the other one) designed for connecting to filesystem-like data, more specifically files on S3. It connects to assets inferred from bucket, prefix, and file name by default_regex.
InferredAssetS3DataConnector that operates on S3 buckets and determines the data_asset_name implicitly (e.g., through the combination of the regular expressions pattern and group names)
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references.
This method is used to refresh the cache.
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetAzureDataConnector
(name: str, datasource_name: str, container: str, assets: dict, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, name_starts_with: str = '', delimiter: str = '/', azure_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of ConfiguredAssetFilePathDataConnector used to connect to Azure
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, splitting and sampling, or other techniques appropriate for obtaining batches of data.
The ConfiguredAssetAzureDataConnector is one of two classes (InferredAssetAzureDataConnector being the other one) designed for connecting to data on Azure.
A ConfiguredAssetAzureDataConnector requires an explicit specification of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup. Please note that in order to maintain consistency with Azure’s official SDK, we utilize terms like “container” and “name_starts_with”.
As much of the interaction with the SDK is done through a BlobServiceClient, please refer to the official docs if a greater understanding of the supported authentication methods and general functionality is desired. Source: https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobserviceclient?view=azure-python
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list_for_asset
(self, asset: Optional[Asset])¶
-
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset] = None)¶
-
-
class
great_expectations.datasource.data_connector.
InferredAssetAzureDataConnector
(name: str, datasource_name: str, container: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, name_starts_with: str = '', delimiter: str = '/', azure_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of InferredAssetFilePathDataConnector used to connect to Azure Blob Storage
The InferredAssetAzureDataConnector is one of two classes (ConfiguredAssetAzureDataConnector being the other one) designed for connecting to filesystem-like data, more specifically files on Azure Blob Storage. It connects to assets inferred from container, name_starts_with, and file name by default_regex.
As much of the interaction with the SDK is done through a BlobServiceClient, please refer to the official docs if a greater understanding of the supported authentication methods and general functionality is desired. Source: https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobserviceclient?view=azure-python
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references.
This method is used to refresh the cache.
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetGCSDataConnector
(name: str, datasource_name: str, bucket_or_name: str, assets: dict, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, prefix: Optional[str] = None, delimiter: Optional[str] = None, max_results: Optional[int] = None, gcs_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of ConfiguredAssetFilePathDataConnector used to connect to GCS
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, splitting and sampling, or other techniques appropriate for obtaining batches of data.
The ConfiguredAssetGCSDataConnector is one of two classes (InferredAssetGCSDataConnector being the other one) designed for connecting to data on GCS.
A ConfiguredAssetGCSDataConnector requires an explicit specification of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup. Please note that in order to maintain consistency with Google’s official SDK, we utilize terms like “bucket_or_name” and “max_results”. Since we convert these keys from YAML to Python and directly pass them in to the GCS connection object, maintaining consistency is necessary for proper usage.
- This DataConnector supports the following methods of authentication:
Standard gcloud auth / GOOGLE_APPLICATION_CREDENTIALS environment variable workflow
Manual creation of credentials from google.oauth2.service_account.Credentials.from_service_account_file
Manual creation of credentials from google.oauth2.service_account.Credentials.from_service_account_info
As much of the interaction with the SDK is done through a GCS Storage Client, please refer to the official docs if a greater understanding of the supported authentication methods and general functionality is desired. Source: https://googleapis.dev/python/google-api-core/latest/auth.html
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list_for_asset
(self, asset: Optional[Asset])¶
-
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset] = None)¶
-
class
great_expectations.datasource.data_connector.
InferredAssetGCSDataConnector
(name: str, datasource_name: str, bucket_or_name: str, execution_engine: Optional[ExecutionEngine] = None, default_regex: Optional[dict] = None, sorters: Optional[list] = None, prefix: Optional[str] = None, delimiter: Optional[str] = None, max_results: Optional[int] = None, gcs_options: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
Extension of ConfiguredAssetFilePathDataConnector used to connect to GCS
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, splitting and sampling, or other techniques appropriate for obtaining batches of data.
The InferredAssetGCSDataConnector is one of two classes (ConfiguredAssetGCSDataConnector being the other one) designed for connecting to data on GCS.
An InferredAssetGCSDataConnector uses regular expressions to traverse through GCS buckets and implicitly determine data_asset_names. Please note that in order to maintain consistency with Google’s official SDK, we utilize terms like “bucket_or_name” and “max_results”. Since we convert these keys from YAML to Python and directly pass them in to the GCS connection object, maintaining consistency is necessary for proper usage.
- This DataConnector supports the following methods of authentication:
Standard gcloud auth / GOOGLE_APPLICATION_CREDENTIALS environment variable workflow
Manual creation of credentials from google.oauth2.service_account.Credentials.from_service_account_file
Manual creation of credentials from google.oauth2.service_account.Credentials.from_service_account_info
As much of the interaction with the SDK is done through a GCS Storage Client, please refer to the official docs if a greater understanding of the supported authentication methods and general functionality is desired. Source: https://googleapis.dev/python/google-api-core/latest/auth.html
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache by classes that extend this base DataConnector class
- Parameters
data_asset_name (str) – optional data_asset_name to retrieve more specific results
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetSqlDataConnector
(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, assets: Optional[Dict[str, dict]] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.data_connector.DataConnector
A DataConnector that requires explicit listing of SQL tables you want to connect to.
- Parameters
name (str) – The name of this DataConnector
datasource_name (str) – The name of the Datasource that contains it
execution_engine (ExecutionEngine) – An ExecutionEngine
assets (str) – assets
batch_spec_passthrough (dict) – dictionary with keys that will be added directly to batch_spec
-
property
assets
(self)¶
-
add_data_asset
(self, name: str, config: dict)¶ Add data_asset to DataConnector using data_asset name as key, and data_asset config as value.
-
_update_data_asset_name_from_config
(self, data_asset_name: str, data_asset_config: dict)¶
-
_get_batch_identifiers_list_from_data_asset_config
(self, data_asset_name, data_asset_config)¶
-
_refresh_data_references_cache
(self)¶
-
_get_column_names_from_splitter_kwargs
(self, splitter_kwargs)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this DataConnector.
- Returns
A list of available names
-
get_unmatched_data_references
(self)¶ Returns the list of data_references unmatched by configuration by looping through items in _data_references_cache and returning data_reference that do not have an associated data_asset.
- Returns
list of data_references that are not matched by configuration.
-
get_batch_definition_list_from_batch_request
(self, batch_request: BatchRequest)¶
-
_get_data_reference_list_from_cache_by_data_asset_name
(self, data_asset_name: str)¶ Fetch data_references corresponding to data_asset_name from the cache.
-
_map_data_reference_to_batch_definition_list
(self, data_reference, data_asset_name: Optional[str] = None)¶
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Build BatchSpec from batch_definition by calling DataConnector’s build_batch_spec function.
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
BatchSpec built from batch_definition
-
_generate_batch_spec_parameters_from_batch_definition
(self, batch_definition: BatchDefinition)¶ - Build BatchSpec parameters from batch_definition with the following components:
data_asset_name from batch_definition
batch_identifiers from batch_definition
data_asset from data_connector
- Parameters
batch_definition (BatchDefinition) – to be used to build batch_spec
- Returns
dict built from batch_definition
-
_get_table_name_from_batch_definition
(self, batch_definition: BatchDefinition)¶ - Helper method called by _get_batch_identifiers_list_from_data_asset_config() to parse table_name from data_asset_name in cases
where schema is included.
data_asset_name in those cases are [schema].[table_name].
function will split data_asset_name on [schema]. and return the resulting table_name.
-
_split_on_whole_table
(self, table_name: str)¶ ‘Split’ by returning the whole table
Note: the table_name parameter is a required to keep the signature of this method consistent with other methods.
-
_split_on_column_value
(self, table_name: str, column_name: str)¶ Split using the values in the named column
-
_split_on_converted_datetime
(self, table_name: str, column_name: str, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
_split_on_divided_integer
(self, table_name: str, column_name: str, divisor: int)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_mod_integer
(self, table_name: str, column_name: str, mod: int)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_multi_column_values
(self, table_name: str, column_names: List[str])¶ Split on the joint values in the named columns
-
_split_on_hashed_column
(self, table_name: str, column_name: str, hash_digits: int)¶ Note: this method is experimental. It does not work with all SQL dialects.
-
class
great_expectations.datasource.data_connector.
InferredAssetSqlDataConnector
(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, data_asset_name_prefix: str = '', data_asset_name_suffix: str = '', include_schema_name: bool = False, splitter_method: Optional[str] = None, splitter_kwargs: Optional[dict] = None, sampling_method: Optional[str] = None, sampling_kwargs: Optional[dict] = None, excluded_tables: Optional[list] = None, included_tables: Optional[list] = None, skip_inapplicable_tables: bool = True, introspection_directives: Optional[dict] = None, batch_spec_passthrough: Optional[dict] = None)¶ -
A DataConnector that infers data_asset names by introspecting a SQL database
-
property
assets
(self)¶
-
_refresh_data_references_cache
(self)¶
-
_refresh_introspected_assets_cache
(self, data_asset_name_prefix: str = None, data_asset_name_suffix: str = None, include_schema_name: bool = False, splitter_method: str = None, splitter_kwargs: dict = None, sampling_method: str = None, sampling_kwargs: dict = None, excluded_tables: List = None, included_tables: List = None, skip_inapplicable_tables: bool = True)¶
-
_introspect_db
(self, schema_name: str = None, ignore_information_schemas_and_system_tables: bool = True, information_schemas: List[str] = ['INFORMATION_SCHEMA', 'information_schema', 'performance_schema', 'sys', 'mysql'], system_tables: List[str] = ['sqlite_master'], include_views=True)¶
-
get_available_data_asset_names_and_types
(self)¶ Return the list of asset names and types known by this DataConnector.
- Returns
A list of tuples consisting of available names and types
-
property
-
class
great_expectations.datasource.data_connector.
ConfiguredAssetDBFSDataConnector
(name: str, datasource_name: str, base_directory: str, assets: dict, execution_engine: ExecutionEngine, default_regex: Optional[dict] = None, glob_directive: str = '**/*', sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.ConfiguredAssetFilesystemDataConnector
Extension of ConfiguredAssetFilesystemDataConnector used to connect to the DataBricks File System (DBFS). Note: This works for the current implementation of DBFS. If in the future DBFS diverges from a Filesystem-like implementation, we should instead inherit from ConfiguredAssetFilePathDataConnector or another DataConnector.
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, splitting and sampling, or other techniques appropriate for obtaining batches of data.
The ConfiguredAssetDBFSDataConnector is one of two classes (InferredAssetDBFSDataConnector being the other one) designed for connecting to data on DBFS.
A ConfiguredAssetDBFSDataConnector requires an explicit specification of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup.
-
_get_full_file_path_for_asset
(self, path: str, asset: Optional[Asset] = None)¶
-
-
class
great_expectations.datasource.data_connector.
InferredAssetDBFSDataConnector
(name: str, datasource_name: str, base_directory: str, execution_engine: ExecutionEngine, default_regex: Optional[dict] = None, glob_directive: str = '*', sorters: Optional[list] = None, batch_spec_passthrough: Optional[dict] = None)¶ Bases:
great_expectations.datasource.data_connector.InferredAssetFilesystemDataConnector
Extension of InferredAssetFilesystemDataConnector used to connect to data on a DBFS filesystem. Note: This works for the current implementation of DBFS. If in the future DBFS diverges from a Filesystem-like implementation, we should instead inherit from InferredAssetFilePathDataConnector or another DataConnector.
The InferredAssetDBFSDataConnector is one of two classes (ConfiguredAssetDBFSDataConnector being the other one) designed for connecting to data on a DBFS filesystem. It connects to assets inferred from directory and file name by default_regex and glob_directive.
InferredAssetDBFSDataConnector that operates on file paths and determines the data_asset_name implicitly (e.g., through the combination of the regular expressions pattern and group names)
-
_get_full_file_path
(self, path: str, data_asset_name: Optional[str] = None)¶
-