great_expectations.execution_engine.pandas_execution_engine
¶
Module Contents¶
Classes¶
|
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame. |
Functions¶
-
great_expectations.execution_engine.pandas_execution_engine.
boto3
¶
-
great_expectations.execution_engine.pandas_execution_engine.
logger
¶
-
great_expectations.execution_engine.pandas_execution_engine.
HASH_THRESHOLD
= 1000000000.0¶
-
class
great_expectations.execution_engine.pandas_execution_engine.
PandasExecutionEngine
(*args, **kwargs)¶ Bases:
great_expectations.execution_engine.ExecutionEngine
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.
For the full API reference, please see
Dataset
Notes
Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the
discard_subset_failing_expectations = True
property on the original data frame.Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).
Validation Engine - Pandas - How-to GuideUse Pandas DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: CompleteUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: LowExpectation Completeness: Complete-
recognized_batch_spec_defaults
¶
-
configure_validator
(self, validator)¶ Optionally configure the validator as appropriate for the execution engine.
-
load_batch_data
(self, batch_id: str, batch_data: Any)¶ Loads the specified batch_data into the execution engine
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
property
dataframe
(self)¶ Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception
-
_get_reader_fn
(self, reader_method=None, path=None)¶ Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.
- Parameters
reader_method (str) – the name of the reader method to use, if available.
path (str) – the path used to guess
- Returns
ReaderMethod to use for the filepath
-
static
guess_reader_method_from_path
(path)¶ Helper method for deciding which reader to use to read in a certain path.
- Parameters
path (str) – the to use to guess
- Returns
ReaderMethod to use for the filepath
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)¶ Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access
- Parameters
domain_kwargs (dict) –
domain_type (str or MetricDomainTypes) –
to be using, or a corresponding string value representing it. String types include "identity", (like) –
"column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –
MetricDomainTypes. (class) –
accessor_keys (str iterable) –
the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, batch_identifiers: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, batch_identifiers: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: List[str], batch_identifiers: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'md5')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1)¶ Take a random sample of rows, retaining proportion p
Note: the Random function behaves differently on different dialects of SQL
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶ Hash the values in the named column, and split on that
-
great_expectations.execution_engine.pandas_execution_engine.
hash_pandas_dataframe
(df)¶