great_expectations.execution_engine
¶
Submodules¶
great_expectations.execution_engine.execution_engine
great_expectations.execution_engine.pandas_batch_data
great_expectations.execution_engine.pandas_execution_engine
great_expectations.execution_engine.sparkdf_batch_data
great_expectations.execution_engine.sparkdf_execution_engine
great_expectations.execution_engine.sqlalchemy_batch_data
great_expectations.execution_engine.sqlalchemy_execution_engine
great_expectations.execution_engine.util
Package Contents¶
Classes¶
|
Helper class that provides a standard way to create an ABC using |
|
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame. |
|
This class holds an attribute spark_df which is a spark.sql.DataFrame. |
|
Helper class that provides a standard way to create an ABC using |
-
class
great_expectations.execution_engine.
ExecutionEngine
(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)¶ Bases:
abc.ABC
Helper class that provides a standard way to create an ABC using inheritance.
-
recognized_batch_spec_defaults
¶
-
configure_validator
(self, validator)¶ Optionally configure the validator as appropriate for the execution engine.
-
property
active_batch_data_id
(self)¶ The batch id for the default batch data.
When an execution engine is asked to process a compute domain that does not include a specific batch_id, then the data associated with the active_batch_data_id will be used as the default.
-
property
active_batch_data
(self)¶ The data from the currently-active batch.
-
property
loaded_batch_data_dict
(self)¶ The current dictionary of batches.
-
property
loaded_batch_data_ids
(self)¶
-
property
config
(self)¶
-
property
dialect
(self)¶
-
get_batch_data
(self, batch_spec: BatchSpec)¶ Interprets batch_data and returns the appropriate data.
This method is primarily useful for utility cases (e.g. testing) where data is being fetched without a DataConnector and metadata like batch_markers is unwanted
Note: this method is currently a thin wrapper for get_batch_data_and_markers. It simply suppresses the batch_markers.
-
abstract
get_batch_data_and_markers
(self, batch_spec)¶
-
load_batch_data
(self, batch_id: str, batch_data: Any)¶ Loads the specified batch_data into the execution engine
-
_load_batch_data_from_dict
(self, batch_data_dict)¶ Loads all data in batch_data_dict into load_batch_data
-
resolve_metrics
(self, metrics_to_resolve: Iterable[MetricConfiguration], metrics: Optional[Dict[Tuple[str, str, str], MetricConfiguration]] = None, runtime_configuration: Optional[dict] = None)¶ resolve_metrics is the main entrypoint for an execution engine. The execution engine will compute the value of the provided metrics.
- Parameters
metrics_to_resolve – the metrics to evaluate
metrics – already-computed metrics currently available to the engine
runtime_configuration – runtime configuration information
- Returns
a dictionary with the values for the metrics that have just been resolved.
- Return type
resolved_metrics (Dict)
-
abstract
resolve_metric_bundle
(self, metric_fn_bundle)¶ Resolve a bundle of metrics with the same compute domain as part of a single trip to the compute engine.
-
abstract
get_domain_records
(self, domain_kwargs: dict)¶ get_domain_records computes the full-access data (dataframe or selectable) for computing metrics based on the given domain_kwargs and specific engine semantics.
- Returns
data corresponding to the compute domain
-
abstract
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes])¶ get_compute_domain computes the optimal domain_kwargs for computing metrics based on the given domain_kwargs and specific engine semantics.
- Returns
data corresponding to the compute domain;
a modified copy of domain_kwargs describing the domain of the data returned in (1);
- a dictionary describing the access instructions for data elements included in the compute domain
(e.g. specific column name).
In general, the union of the compute_domain_kwargs and accessor_domain_kwargs will be the same as the domain_kwargs provided to this method.
- Return type
A tuple consisting of three elements
-
add_column_row_condition
(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)¶ EXPERIMENTAL
Add a row condition for handling null filter.
- Parameters
domain_kwargs – the domain kwargs to use as the base and to which to add the condition
column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs
filter_null – if true, add a filter for null values
filter_nan – if true, add a filter for nan values
-
resolve_data_reference
(self, data_connector_name: str, template_arguments: dict)¶ Resolve file path for a (data_connector_name, execution_engine_name) combination.
-
-
class
great_expectations.execution_engine.
PandasExecutionEngine
(*args, **kwargs)¶ Bases:
great_expectations.execution_engine.ExecutionEngine
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.
For the full API reference, please see
Dataset
Notes
Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the
discard_subset_failing_expectations = True
property on the original data frame.Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).
Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: CompleteUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: LowExpectation Completeness: Complete-
recognized_batch_spec_defaults
¶
-
_instantiate_azure_client
(self)¶
-
_instantiate_s3_client
(self)¶
-
_instantiate_gcs_client
(self)¶ Helper method for instantiating GCS client when GCSBatchSpec is passed in.
- The method accounts for 3 ways that a GCS connection can be configured:
setting an environment variable, which is typically GOOGLE_APPLICATION_CREDENTIALS
passing in explicit credentials via gcs_options
- running Great Expectations from within a GCP container, at which you would be able to create a Client
without passing in an additional environment variable or explicit credentials
-
configure_validator
(self, validator)¶ Optionally configure the validator as appropriate for the execution engine.
-
load_batch_data
(self, batch_id: str, batch_data: Any)¶ Loads the specified batch_data into the execution engine
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
property
dataframe
(self)¶ Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception
-
static
guess_reader_method_from_path
(path)¶ Helper method for deciding which reader to use to read in a certain path.
- Parameters
path (str) – the to use to guess
- Returns
ReaderMethod to use for the filepath
-
_get_reader_fn
(self, reader_method=None, path=None)¶ Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.
- Parameters
reader_method (str) – the name of the reader method to use, if available.
path (str) – the path used to guess
- Returns
ReaderMethod to use for the filepath
-
get_domain_records
(self, domain_kwargs: dict)¶ Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame.
- Parameters
domain_kwargs (dict) –
- Returns
A DataFrame (the data on which to compute)
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)¶ Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access
- Parameters
domain_kwargs (dict) –
domain_type (str or MetricDomainTypes) –
to be using, or a corresponding string value representing it. String types include "column", (like) –
"table", and "other". Enum types include capitalized versions of these from the ("column_pair",) –
MetricDomainTypes. (class) –
accessor_keys (str iterable) –
the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, batch_identifiers: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: List[str], batch_identifiers: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'md5')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1)¶ Take a random sample of rows, retaining proportion p
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶ Hash the values in the named column, and only keep rows that match the given hash_value
-
class
great_expectations.execution_engine.
SparkDFExecutionEngine
(*args, persist=True, spark_config=None, force_reuse_spark_context=False, **kwargs)¶ Bases:
great_expectations.execution_engine.ExecutionEngine
This class holds an attribute spark_df which is a spark.sql.DataFrame.
Validation Engine - pyspark - Self-Managed - How-to Guide
Use Spark DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: ModerateUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Databricks - How-to Guide
Use Spark DataFrame in a Databricks cluster to validate dataMaturity: BetaDetails:API Stability: StableImplementation Completeness: Low (dbfs-specific handling)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - EMR - Spark - How-to Guide
Use Spark DataFrame in an EMR cluster to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Spark - Other - How-to Guide
Use Spark DataFrame to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Other (we haven’t tested possibility, known glue deployment)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: Moderate-
recognized_batch_definition_keys
¶
-
recognized_batch_spec_defaults
¶
-
property
dataframe
(self)¶ If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch
-
load_batch_data
(self, batch_id: str, batch_data: Any)¶ Loads the specified batch_data into the execution engine
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
static
guess_reader_method_from_path
(path)¶ Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns ExecutionEngineError stating that it is unable to determine the current path.
- Parameters
- A given file path (path) –
- Returns
reader_method}
- Return type
A dictionary entry of format {‘reader_method’
-
_get_reader_fn
(self, reader, reader_method=None, path=None)¶ Static helper for providing reader_fn
- Parameters
reader – the base spark reader to use; this should have had reader_options applied already
reader_method – the name of the reader_method to use, if specified
path (str) – the path to use to guess reader_method if it was not specified
- Returns
ReaderMethod to use for the filepath
-
get_domain_records
(self, domain_kwargs: dict)¶ Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of a Spark DataFrame.
- Parameters
domain_kwargs (dict) –
- Returns
A DataFrame (the data on which to compute)
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)¶ Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Spark DataFrame.
- Parameters
domain_kwargs (dict) –
domain_type (str or MetricDomainTypes) –
to be using, or a corresponding string value representing it. String types include "identity", (like) –
"column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –
MetricDomainTypes. (class) –
accessor_keys (str iterable) –
the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
add_column_row_condition
(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)¶ EXPERIMENTAL
Add a row condition for handling null filter.
- Parameters
domain_kwargs – the domain kwargs to use as the base and to which to add the condition
column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs
filter_null – if true, add a filter for null values
filter_nan – if true, add a filter for nan values
-
resolve_metric_bundle
(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])¶ For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.
- Args:
metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions
- Returns:
A dictionary of the collected metrics over their respective domains
-
head
(self, n=5)¶ Returns dataframe head. Default is 5
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, batch_identifiers: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, batch_identifiers: dict, date_format_string: str = 'yyyy-MM-dd')¶
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: list, batch_identifiers: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'sha256')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1, seed: int = 1)¶ Take a random sample of rows, retaining proportion p
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶
-
-
class
great_expectations.execution_engine.
SqlAlchemyExecutionEngine
(name: Optional[str] = None, credentials: Optional[dict] = None, data_context: Optional[Any] = None, engine=None, connection_string: Optional[str] = None, url: Optional[str] = None, batch_data_dict: Optional[dict] = None, create_temp_table: bool = True, concurrency: Optional[ConcurrencyConfig] = None, **kwargs)¶ Bases:
great_expectations.execution_engine.ExecutionEngine
Helper class that provides a standard way to create an ABC using inheritance.
-
property
credentials
(self)¶
-
property
connection_string
(self)¶
-
property
url
(self)¶
-
_build_engine
(self, credentials: dict, **kwargs)¶ Using a set of given credentials, constructs an Execution Engine , connecting to a database using a URL or a private key path.
-
_get_sqlalchemy_key_pair_auth_url
(self, drivername: str, credentials: dict)¶ Utilizing a private key path and a passphrase in a given credentials dictionary, attempts to encode the provided values into a private key. If passphrase is incorrect, this will fail and an exception is raised.
- Parameters
drivername (str) –
credentials (dict) –
- Returns
a tuple consisting of a url with the serialized key-pair authentication, and a dictionary of engine kwargs.
-
get_domain_records
(self, domain_kwargs: Dict)¶ Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of an SqlAlchemy table/column(s) object.
- Parameters
domain_kwargs (dict) –
- Returns
An SqlAlchemy table/column(s) (the selectable object for obtaining data on which to compute)
-
get_compute_domain
(self, domain_kwargs: Dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)¶ Uses a given batch dictionary and domain kwargs to obtain a SqlAlchemy column object.
- Parameters
domain_kwargs (dict) –
domain_type (str or MetricDomainTypes) –
to be using, or a corresponding string value representing it. String types include "identity", (like) –
"column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –
MetricDomainTypes. (class) –
accessor_keys (str iterable) –
the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –
- Returns
SqlAlchemy column
-
resolve_metric_bundle
(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Any, dict, dict]])¶ For every metric in a set of Metrics to resolve, obtains necessary metric keyword arguments and builds bundles of the metrics into one large query dictionary so that they are all executed simultaneously. Will fail if bundling the metrics together is not possible.
- Args:
- metric_fn_bundle (Iterable[Tuple[MetricConfiguration, Callable, dict]): A Dictionary containing a MetricProvider’s MetricConfiguration (its unique identifier), its metric provider function
(the function that actually executes the metric), and the arguments to pass to the metric provider function. A dictionary of metrics defined in the registry and corresponding arguments
- Returns:
A dictionary of metric names and their corresponding now-queried values.
-
close
(self)¶ Note: Will 20210729
This is a helper function that will close and dispose Sqlalchemy objects that are used to connect to a database. Databases like Snowflake require the connection and engine to be instantiated and closed separately, and not doing so has caused problems with hanging connections.
Currently the ExecutionEngine does not support handling connections and engine separately, and will actually override the engine with a connection in some cases, obfuscating what object is used to actually used by the ExecutionEngine to connect to the external database. This will be handled in an upcoming refactor, which will allow this function to eventually become:
self.connection.close() self.engine.dispose()
More background can be found here: https://github.com/great-expectations/great_expectations/pull/3104/
-
_split_on_whole_table
(self, table_name: str, batch_identifiers: dict)¶ ‘Split’ by returning the whole table
-
_split_on_column_value
(self, table_name: str, column_name: str, batch_identifiers: dict)¶ Split using the values in the named column
-
_split_on_converted_datetime
(self, table_name: str, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
_split_on_divided_integer
(self, table_name: str, column_name: str, divisor: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_mod_integer
(self, table_name: str, column_name: str, mod: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_multi_column_values
(self, table_name: str, column_names: List[str], batch_identifiers: dict)¶ Split on the joint values in the named columns
-
_split_on_hashed_column
(self, table_name: str, column_name: str, hash_digits: int, batch_identifiers: dict)¶ Split on the hashed value of the named column
-
_sample_using_mod
(self, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
_sample_using_a_list
(self, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
_sample_using_md5
(self, column_name: str, hash_digits: int = 1, hash_value: str = 'f')¶ Hash the values in the named column, and split on that
-
_build_selectable_from_batch_spec
(self, batch_spec: BatchSpec)¶
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
property