great_expectations.execution_engine.sparkdf_execution_engine
¶
Module Contents¶
Classes¶
|
|
|
This class holds an attribute spark_df which is a spark.sql.DataFrame. |
-
great_expectations.execution_engine.sparkdf_execution_engine.
logger
¶
-
class
great_expectations.execution_engine.sparkdf_execution_engine.
SparkDFBatchData
(df)¶ Bases:
pyspark.sql.DataFrame
-
row_count
(self)¶
-
-
class
great_expectations.execution_engine.sparkdf_execution_engine.
SparkDFExecutionEngine
(*args, **kwargs)¶ Bases:
great_expectations.execution_engine.execution_engine.ExecutionEngine
This class holds an attribute spark_df which is a spark.sql.DataFrame.
Validation Engine - pyspark - Self-Managed - How-to GuideUse Spark DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: ModerateUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Databricks - How-to GuideUse Spark DataFrame in a Databricks cluster to validate dataMaturity: BetaDetails:API Stability: StableImplementation Completeness: Low (dbfs-specific handling)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - EMR - Spark - How-to GuideUse Spark DataFrame in an EMR cluster to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Spark - Other - How-to GuideUse Spark DataFrame to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Other (we haven’t tested possibility, known glue deployment)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: Moderate-
recognized_batch_definition_keys
¶
-
recognized_batch_spec_defaults
¶
-
property
dataframe
(self)¶ If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
static
guess_reader_method_from_path
(path)¶ Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns BatchKwargsError stating that it is unable to determine the current path.
- Parameters
- A given file path (path) –
- Returns
reader_method}
- Return type
A dictionary entry of format {‘reader_method’
-
_get_reader_fn
(self, reader, reader_method=None, path=None)¶ Static helper for providing reader_fn
- Parameters
reader – the base spark reader to use; this should have had reader_options applied already
reader_method – the name of the reader_method to use, if specified
path (str) – the path to use to guess reader_method if it was not specified
- Returns
ReaderMethod to use for the filepath
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])¶ Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas Series if only a single column is desired, or otherwise a Data Frame.
- Parameters
domain_kwargs (dict) –
domain_type (str or "MetricDomainTypes") –
to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –
"table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –
MetricDomainTypes. –
accessor_keys (str iterable) –
domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
add_column_row_condition
(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)¶ EXPERIMENTAL
Add a row condition for handling null filter.
- Parameters
domain_kwargs – the domain kwargs to use as the base and to which to add the condition
column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs
filter_null – if true, add a filter for null values
filter_nan – if true, add a filter for nan values
-
resolve_metric_bundle
(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])¶ For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.
- Args:
metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions metrics (dict) - A dictionary containing metrics and corresponding parameters
- Returns:
A dictionary of the collected metrics over their respective domains
-
head
(self, n=5)¶ Returns dataframe head. Default is 5
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, partition_definition: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, partition_definition: dict, date_format_string: str = 'yyyy-MM-dd')¶
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: list, partition_definition: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'sha256')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1, seed: int = 1)¶ Take a random sample of rows, retaining proportion p
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶
-