great_expectations.execution_engine.pandas_execution_engine

Module Contents

Classes

PandasExecutionEngine(*args, **kwargs)

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

Functions

hash_pandas_dataframe(df)

great_expectations.execution_engine.pandas_execution_engine.boto3
great_expectations.execution_engine.pandas_execution_engine.logger
great_expectations.execution_engine.pandas_execution_engine.HASH_THRESHOLD = 1000000000.0
class great_expectations.execution_engine.pandas_execution_engine.PandasExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

For the full API reference, please see Dataset

Notes

  1. Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the discard_subset_failing_expectations = True property on the original data frame.

  2. Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).

Feature Maturity

icon-873a8dd4f62f11eb87140242ac110002 Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low
Expectation Completeness: Complete
recognized_batch_spec_defaults
configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
property dataframe(self)

Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception

_get_reader_fn(self, reader_method=None, path=None)

Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.

Parameters
  • reader_method (str) – the name of the reader method to use, if available.

  • path (str) – the path used to guess

Returns

ReaderMethod to use for the filepath

static guess_reader_method_from_path(path)

Helper method for deciding which reader to use to read in a certain path.

Parameters

path (str) – the to use to guess

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or MetricDomainTypes) –

  • to be using, or a corresponding string value representing it. String types include "identity", (like) –

  • "column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –

  • MetricDomainTypes. (class) –

  • accessor_keys (str iterable) –

  • the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, batch_identifiers: dict)
static _split_on_converted_datetime(df, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

static _split_on_divided_integer(df, column_name: str, divisor: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: List[str], batch_identifiers: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'md5')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1)

Take a random sample of rows, retaining proportion p

Note: the Random function behaves differently on different dialects of SQL

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')

Hash the values in the named column, and split on that

great_expectations.execution_engine.pandas_execution_engine.hash_pandas_dataframe(df)