great_expectations.execution_engine.pandas_execution_engine

Module Contents

Classes

PandasExecutionEngine(*args, **kwargs)

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

Functions

hash_pandas_dataframe(df)

great_expectations.execution_engine.pandas_execution_engine.logger
great_expectations.execution_engine.pandas_execution_engine.boto3
great_expectations.execution_engine.pandas_execution_engine.BlobServiceClient
great_expectations.execution_engine.pandas_execution_engine.storage
great_expectations.execution_engine.pandas_execution_engine.HASH_THRESHOLD = 1000000000.0
class great_expectations.execution_engine.pandas_execution_engine.PandasExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

For the full API reference, please see Dataset

Notes

  1. Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the discard_subset_failing_expectations = True property on the original data frame.

  2. Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).

Feature Maturity

icon-9e594aa095bc11ecb09f0242ac110002 Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low
Expectation Completeness: Complete
recognized_batch_spec_defaults
_instantiate_azure_client(self)
_instantiate_s3_client(self)
_instantiate_gcs_client(self)

Helper method for instantiating GCS client when GCSBatchSpec is passed in.

The method accounts for 3 ways that a GCS connection can be configured:
  1. setting an environment variable, which is typically GOOGLE_APPLICATION_CREDENTIALS

  2. passing in explicit credentials via gcs_options

  3. running Great Expectations from within a GCP container, at which you would be able to create a Client

    without passing in an additional environment variable or explicit credentials

configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
property dataframe(self)

Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception

static guess_reader_method_from_path(path)

Helper method for deciding which reader to use to read in a certain path.

Parameters

path (str) – the to use to guess

Returns

ReaderMethod to use for the filepath

_get_reader_fn(self, reader_method=None, path=None)

Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.

Parameters
  • reader_method (str) – the name of the reader method to use, if available.

  • path (str) – the path used to guess

Returns

ReaderMethod to use for the filepath

get_domain_records(self, domain_kwargs: dict)

Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame.

Parameters

domain_kwargs (dict) –

Returns

A DataFrame (the data on which to compute)

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)

Uses the given domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or MetricDomainTypes) –

  • to be using, or a corresponding string value representing it. String types include "column", (like) –

  • "table", and "other". Enum types include capitalized versions of these from the ("column_pair",) –

  • MetricDomainTypes. (class) –

  • accessor_keys (str iterable) –

  • the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, batch_identifiers: dict)
static _split_on_converted_datetime(df, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

static _split_on_divided_integer(df, column_name: str, divisor: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: List[str], batch_identifiers: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'md5')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1)

Take a random sample of rows, retaining proportion p

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')

Hash the values in the named column, and only keep rows that match the given hash_value

great_expectations.execution_engine.pandas_execution_engine.hash_pandas_dataframe(df)