great_expectations.datasource.data_connector.data_connector
¶
Module Contents¶
Classes¶
|
DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines |
-
great_expectations.datasource.data_connector.data_connector.
logger
¶
-
class
great_expectations.datasource.data_connector.data_connector.
DataConnector
(name: str, datasource_name: str, execution_engine: ExecutionEngine, batch_spec_passthrough: Optional[dict] = None)¶ DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the Datasource.
For example, a DataConnector could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”
A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_spec” assembled by the data connector, While not every Datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.
-
property
batch_spec_passthrough
(self)¶
-
property
name
(self)¶
-
property
datasource_name
(self)¶
-
property
execution_engine
(self)¶
-
property
data_context_root_directory
(self)¶
-
get_batch_data_and_metadata
(self, batch_definition: BatchDefinition)¶ Uses batch_definition to retrieve batch_data and batch_markers by building a batch_spec from batch_definition, then using execution_engine to return batch_data and batch_markers
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
-
build_batch_spec
(self, batch_definition: BatchDefinition)¶ Builds batch_spec from batch_definition by generating batch_spec params and adding any pass_through params
- Parameters
batch_definition (BatchDefinition) – required batch_definition parameter for retrieval
- Returns
BatchSpec object built from BatchDefinition
-
abstract
_refresh_data_references_cache
(self)¶
-
abstract
_get_data_reference_list
(self, data_asset_name: Optional[str] = None)¶ List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache by classes that extend this base DataConnector class
- Parameters
data_asset_name (str) – optional data_asset_name to retrieve more specific results
-
abstract
_get_data_reference_list_from_cache_by_data_asset_name
(self, data_asset_name: str)¶ Fetch data_references corresponding to data_asset_name from the cache.
-
abstract
get_data_reference_list_count
(self)¶
-
abstract
get_unmatched_data_references
(self)¶
-
abstract
get_available_data_asset_names
(self)¶ Return the list of asset names known by this data connector.
- Returns
A list of available names
-
abstract
get_available_data_asset_names_and_types
(self)¶ Return the list of asset names and types known by this DataConnector.
- Returns
A list of tuples consisting of available names and types
-
abstract
get_batch_definition_list_from_batch_request
(self, batch_request: BatchRequestBase)¶
-
abstract
_map_data_reference_to_batch_definition_list
(self, data_reference: Any, data_asset_name: Optional[str] = None)¶
-
abstract
_map_batch_definition_to_data_reference
(self, batch_definition: BatchDefinition)¶
-
abstract
_generate_batch_spec_parameters_from_batch_definition
(self, batch_definition: BatchDefinition)¶
-
self_check
(self, pretty_print=True, max_examples=3)¶ Checks the configuration of the current DataConnector by doing the following :
refresh or create data_reference_cache
print batch_definition_count and example_data_references for each data_asset_names
also print unmatched data_references, and allow the user to modify the regex or glob configuration if necessary
select a random data_reference and attempt to retrieve and print the first few rows to user
When used as part of the test_yaml_config() workflow, the user will be able to know if the data_connector is properly configured, and if the associated execution_engine can properly retrieve data using the configuration.
- Parameters
pretty_print (bool) – should the output be printed?
max_examples (int) – how many data_references should be printed?
- Returns
dictionary containing self_check output
- Return type
report_obj (dict)
-
_self_check_fetch_batch
(self, pretty_print: bool, example_data_reference: Any, data_asset_name: str)¶ Helper function for self_check() to retrieve batch using example_data_reference and data_asset_name, all while printing helpful messages. First 5 rows of batch_data are printed by default.
- Parameters
pretty_print (bool) – print to console?
example_data_reference (Any) – data_reference to retrieve
data_asset_name (str) – data_asset_name to retrieve
-
_validate_batch_request
(self, batch_request: BatchRequestBase)¶ - Validate batch_request by checking:
if configured datasource_name matches batch_request’s datasource_name
if current data_connector_name matches batch_request’s data_connector_name
- Parameters
batch_request (BatchRequestBase) – batch_request object to validate
-
property