great_expectations.execution_engine.sparkdf_execution_engine

Module Contents

Classes

SparkDFBatchData(df)

SparkDFExecutionEngine(*args, **kwargs)

This class holds an attribute spark_df which is a spark.sql.DataFrame.

great_expectations.execution_engine.sparkdf_execution_engine.logger
class great_expectations.execution_engine.sparkdf_execution_engine.SparkDFBatchData(df)

Bases: pyspark.sql.DataFrame

row_count(self)
class great_expectations.execution_engine.sparkdf_execution_engine.SparkDFExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.execution_engine.ExecutionEngine

This class holds an attribute spark_df which is a spark.sql.DataFrame.

Feature Maturity

icon-f8dadab4612611ebae8a0242ac110002 Validation Engine - pyspark - Self-Managed - How-to Guide
Use Spark DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Moderate
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-f8dadcc6612611ebae8a0242ac110002 Validation Engine - Databricks - How-to Guide
Use Spark DataFrame in a Databricks cluster to validate data
Maturity: Beta
Details:
API Stability: Stable
Implementation Completeness: Low (dbfs-specific handling)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)
Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-f8daddac612611ebae8a0242ac110002 Validation Engine - EMR - Spark - How-to Guide
Use Spark DataFrame in an EMR cluster to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-f8dade7e612611ebae8a0242ac110002 Validation Engine - Spark - Other - How-to Guide
Use Spark DataFrame to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Other (we haven’t tested possibility, known glue deployment)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
recognized_batch_definition_keys
recognized_batch_spec_defaults
property dataframe(self)

If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
static guess_reader_method_from_path(path)

Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns BatchKwargsError stating that it is unable to determine the current path.

Parameters

- A given file path (path) –

Returns

reader_method}

Return type

A dictionary entry of format {‘reader_method’

_get_reader_fn(self, reader, reader_method=None, path=None)

Static helper for providing reader_fn

Parameters
  • reader – the base spark reader to use; this should have had reader_options applied already

  • reader_method – the name of the reader_method to use, if specified

  • path (str) – the path to use to guess reader_method if it was not specified

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas Series if only a single column is desired, or otherwise a Data Frame.

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or "MetricDomainTypes") –

  • to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –

  • "table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –

  • MetricDomainTypes.

  • accessor_keys (str iterable) –

  • domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

add_column_row_condition(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)

EXPERIMENTAL

Add a row condition for handling null filter.

Parameters
  • domain_kwargs – the domain kwargs to use as the base and to which to add the condition

  • column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs

  • filter_null – if true, add a filter for null values

  • filter_nan – if true, add a filter for nan values

resolve_metric_bundle(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])

For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.

Args:

metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions metrics (dict) - A dictionary containing metrics and corresponding parameters

Returns:

A dictionary of the collected metrics over their respective domains

head(self, n=5)

Returns dataframe head. Default is 5

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, partition_definition: dict)
static _split_on_converted_datetime(df, column_name: str, partition_definition: dict, date_format_string: str = 'yyyy-MM-dd')
static _split_on_divided_integer(df, column_name: str, divisor: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: list, partition_definition: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'sha256')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1, seed: int = 1)

Take a random sample of rows, retaining proportion p

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')