great_expectations.expectations.core.expect_column_kl_divergence_to_be_less_than

Module Contents

Classes

ExpectColumnKlDivergenceToBeLessThan(configuration: Optional[ExpectationConfiguration] = None)

Expect the Kulback-Leibler (KL) divergence (relative entropy) of the specified column with respect to the partition object to be lower than the provided threshold.

class great_expectations.expectations.core.expect_column_kl_divergence_to_be_less_than.ExpectColumnKlDivergenceToBeLessThan(configuration: Optional[ExpectationConfiguration] = None)

Bases: great_expectations.expectations.expectation.ColumnExpectation

Expect the Kulback-Leibler (KL) divergence (relative entropy) of the specified column with respect to the partition object to be lower than the provided threshold.

KL divergence compares two distributions. The higher the divergence value (relative entropy), the larger the difference between the two distributions. A relative entropy of zero indicates that the data are distributed identically, when binned according to the provided partition.

In many practical contexts, choosing a value between 0.5 and 1 will provide a useful test.

This expectation works on both categorical and continuous partitions. See notes below for details.

expect_column_kl_divergence_to_be_less_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • partition_object (dict) – The expected partition object (see Partition Objects).

  • threshold (float) – The maximum KL divergence to for which to return success=True. If KL divergence is larger than the provided threshold, the test will return success=False.

Keyword Arguments
  • internal_weight_holdout (float between 0 and 1 or None) – The amount of weight to split uniformly among zero-weighted partition bins. internal_weight_holdout provides a mechanisms to make the test less strict by assigning positive weights to values observed in the data for which the partition explicitly expected zero weight. With no internal_weight_holdout, any value observed in such a region will cause KL divergence to rise to +Infinity. Defaults to 0.

  • tail_weight_holdout (float between 0 and 1 or None) – The amount of weight to add to the tails of the histogram. Tail weight holdout is split evenly between (-Infinity, min(partition_object[‘bins’])) and (max(partition_object[‘bins’]), +Infinity). tail_weight_holdout provides a mechanism to make the test less strict by assigning positive weights to values observed in the data that are not present in the partition. With no tail_weight_holdout, any value observed outside the provided partition_object will cause KL divergence to rise to +Infinity. Defaults to 0.

  • bucketize_data (boolean) – If True, then continuous data will be bucketized before evaluation. Setting this parameter to false allows evaluation of KL divergence with a None partition object for profiling against discrete data.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

An ExpectationSuiteValidationResult

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
  "observed_value": (float) The true KL divergence (relative entropy) or None if the value is                       calculated as infinity, -infinity, or NaN
  "details": {
    "observed_partition": (dict) The partition observed in the data
    "expected_partition": (dict) The partition against which the data were compared,
                            after applying specified weight holdouts.
  }
}

If the partition_object is categorical, this expectation will expect the values in column to also be categorical.

  • If the column includes values that are not present in the partition, the tail_weight_holdout

will be equally split among those values, providing a mechanism to weaken the strictness of the expectation (otherwise, relative entropy would immediately go to infinity). * If the partition includes values that are not present in the column, the test will simply include zero weight for that value.

If the partition_object is continuous, this expectation will discretize the values in the column according to the bins specified in the partition_object, and apply the test to the resulting distribution.

  • The internal_weight_holdout and tail_weight_holdout parameters provide a mechanism to weaken the expectation, since an expected weight of zero would drive relative entropy to be infinite if any

data are observed in that interval. * If internal_weight_holdout is specified, that value will be distributed equally among any intervals with weight zero in the partition_object. * If tail_weight_holdout is specified, that value will be appended to the tails of the bins ((-Infinity, min(bins)) and (max(bins), Infinity).

If relative entropy/kl divergence goes to infinity for any of the reasons mentioned above, the observed value will be set to None. This is because inf, -inf, Nan, are not json serializable and cause some json parsers to crash when encountered. The python None token will be serialized to null in json.

library_metadata
success_keys = ['partition_object', 'threshold', 'tail_weight_holdout', 'internal_weight_holdout', 'bucketize_data']
default_kwarg_values
args_keys = ['column', 'partition_object', 'threshold']
get_validation_dependencies(self, configuration: Optional[ExpectationConfiguration] = None, execution_engine: Optional[ExecutionEngine] = None, runtime_configuration: Optional[dict] = None)

Returns the result format and metrics required to validate this Expectation using the provided result format.

_validate(self, configuration: ExpectationConfiguration, metrics: Dict, runtime_configuration: dict = None, execution_engine: ExecutionEngine = None)
classmethod _get_kl_divergence_chart(cls, partition_object, header=None)
classmethod _atomic_kl_divergence_chart_template(cls, partition_object: dict)
classmethod _get_kl_divergence_partition_object_table(cls, partition_object, header=None)
classmethod _atomic_partition_object_table_template(cls, partition_object: dict)
classmethod _atomic_prescriptive_template(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)

Template function that contains the logic that is shared by atomic.prescriptive.summary (GE Cloud) and renderer.prescriptive (OSS GE)

classmethod _prescriptive_summary(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)

Rendering function that is utilized by GE Cloud Front-end

classmethod _prescriptive_renderer(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)
classmethod _atomic_diagnostic_observed_value_template(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)
classmethod _atomic_diagnostic_observed_value(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)

Rendering function that is utilized by GE Cloud Front-end

classmethod _diagnostic_observed_value_renderer(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)
classmethod _descriptive_histogram_renderer(cls, configuration=None, result=None, language=None, runtime_configuration=None, **kwargs)