great_expectations.rule_based_profiler.parameter_builder

Package Contents

Classes

ParameterBuilder(name: str, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None, data_context: Optional[‘DataContext’] = None)

A ParameterBuilder implementation provides support for building Expectation Configuration Parameters suitable for

RegexPatternStringParameterBuilder(name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, threshold: Union[float, str] = 1.0, candidate_regexes: Optional[Union[Iterable[str], str]] = None, data_context: Optional[‘DataContext’] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Detects the domain REGEX from a set of candidate REGEX strings by computing the

SimpleDateFormatStringParameterBuilder(name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, threshold: Union[float, str] = 1.0, candidate_strings: Optional[Union[Iterable[str], str]] = None, data_context: Optional[‘DataContext’] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Detects the domain date format from a set of candidate date format strings by computing the

MetricMultiBatchParameterBuilder(name: str, metric_name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, enforce_numeric_metric: Union[str, bool] = False, replace_nan_with_zero: Union[str, bool] = False, reduce_scalar_metric: Union[str, bool] = True, data_context: Optional[‘DataContext’] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

A Single/Multi-Batch implementation for obtaining a resolved (evaluated) metric, using domain_kwargs, value_kwargs,

NumericMetricRangeMultiBatchParameterBuilder(name: str, metric_name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, sampling_method: str = ‘bootstrap’, enforce_numeric_metric: Union[str, bool] = True, replace_nan_with_zero: Union[str, bool] = True, reduce_scalar_metric: Union[str, bool] = True, false_positive_rate: Union[str, float] = 0.05, num_bootstrap_samples: Optional[Union[str, int]] = None, round_decimals: Optional[Union[str, int]] = None, truncate_values: Optional[Union[str, Dict[str, Union[Optional[int], Optional[float]]]]] = None, data_context: Optional[‘DataContext’] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

A Multi-Batch implementation for obtaining the range estimation bounds for a resolved (evaluated) numeric metric,

class great_expectations.rule_based_profiler.parameter_builder.ParameterBuilder(name: str, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None, data_context: Optional['DataContext'] = None)

Bases: great_expectations.rule_based_profiler.types.Builder, abc.ABC

A ParameterBuilder implementation provides support for building Expectation Configuration Parameters suitable for use in other ParameterBuilders or in ConfigurationBuilders as part of profiling.

A ParameterBuilder is configured as part of a ProfilerRule. Its primary interface is the build_parameters method.

As part of a ProfilerRule, the following configuration will create a new parameter for each domain returned by the domain_builder, with an associated id.

``` parameter_builders:

  • name: my_parameter_builder class_name: MetricMultiBatchParameterBuilder metric_name: column.mean

```

exclude_field_names :Set[str]
build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
property name(self)
property batch_request(self)
property batch_list(self)
property data_context(self)
abstract _build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
get_validator(self, domain: Optional[Domain] = None, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
get_batch_ids(self, domain: Optional[Domain] = None, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
get_metrics(self, metric_name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, enforce_numeric_metric: Union[str, bool] = False, replace_nan_with_zero: Union[str, bool] = False, domain: Optional[Domain] = None, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)

General multi-batch metric computation facility.

Computes specified metric (can be multi-dimensional, numeric, non-numeric, or mixed) and conditions (or “sanitizes”) result according to two criteria: enforcing metric output to be numeric and handling NaN values. :param metric_name: Name of metric of interest, being computed. :param metric_domain_kwargs: Metric Domain Kwargs is an essential parameter of the MetricConfiguration object. :param metric_value_kwargs: Metric Value Kwargs is an essential parameter of the MetricConfiguration object. :param enforce_numeric_metric: Flag controlling whether or not metric output must be numerically-valued. :param replace_nan_with_zero: Directive controlling how NaN metric values, if encountered, should be handled. :param domain: Domain object scoping “$variable”/”$parameter”-style references in configuration and runtime. :param variables: Part of the “rule state” available for “$variable”-style references. :param parameters: Part of the “rule state” available for “$parameter”-style references. :return: MetricComputationResult object, containing both: data samples in the format “N x R^m”, where “N” (most significant dimension) is the number of measurements (e.g., one per Batch of data), while “R^m” is the multi-dimensional metric, whose values are being estimated, and details (to be used for metadata purposes).

_sanitize_metric_computation(self, metric_name: str, metric_values: np.ndarray, enforce_numeric_metric: Union[str, bool] = False, replace_nan_with_zero: Union[str, bool] = False, domain: Optional[Domain] = None, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)

This method conditions (or “sanitizes”) data samples in the format “N x R^m”, where “N” (most significant dimension) is the number of measurements (e.g., one per Batch of data), while “R^m” is the multi-dimensional metric, whose values are being estimated. The “conditioning” operations are: 1. If “enforce_numeric_metric” flag is set, raise an error if a non-numeric value is found in sample vectors. 2. Further, if a NaN is encountered in a sample vectors and “replace_nan_with_zero” is True, then replace those NaN values with the 0.0 floating point number; if “replace_nan_with_zero” is False, then raise an error.

class great_expectations.rule_based_profiler.parameter_builder.RegexPatternStringParameterBuilder(name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, threshold: Union[float, str] = 1.0, candidate_regexes: Optional[Union[Iterable[str], str]] = None, data_context: Optional['DataContext'] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Bases: great_expectations.rule_based_profiler.parameter_builder.parameter_builder.ParameterBuilder

Detects the domain REGEX from a set of candidate REGEX strings by computing the column_values.match_regex_format.unexpected_count metric for each candidate format and returning the format that has the lowest unexpected_count ratio.

CANDIDATE_REGEX :Set[str]
property metric_domain_kwargs(self)
property metric_value_kwargs(self)
property threshold(self)
property candidate_regexes(self)
_build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)

Check the percentage of values matching the REGEX string, and return the best fit, or None if no string exceeds the configured threshold.

Returns

ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and optional details

_get_regex_matched_greater_than_threshold(self, regex_string_success_ratio_dict: dict, threshold: float)

Helper method to calculate which regex_strings match greater than threshold

_get_sorted_regex_and_ratios(self, regex_string_success_ratio_dict: dict)

Helper method to sort all regexes that were evaluated by their success ratio. Returns Tuple(ratio, sorted_strings)

class great_expectations.rule_based_profiler.parameter_builder.SimpleDateFormatStringParameterBuilder(name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, threshold: Union[float, str] = 1.0, candidate_strings: Optional[Union[Iterable[str], str]] = None, data_context: Optional['DataContext'] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Bases: great_expectations.rule_based_profiler.parameter_builder.parameter_builder.ParameterBuilder

Detects the domain date format from a set of candidate date format strings by computing the column_values.match_strftime_format.unexpected_count metric for each candidate format and returning the format that has the lowest unexpected_count ratio.

CANDIDATE_STRINGS :Set[str]
property metric_domain_kwargs(self)
property metric_value_kwargs(self)
property threshold(self)
property candidate_strings(self)
_build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)

Check the percentage of values matching each string, and return the best fit, or None if no string exceeds the configured threshold.

Returns

ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and optional details

class great_expectations.rule_based_profiler.parameter_builder.MetricMultiBatchParameterBuilder(name: str, metric_name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, enforce_numeric_metric: Union[str, bool] = False, replace_nan_with_zero: Union[str, bool] = False, reduce_scalar_metric: Union[str, bool] = True, data_context: Optional['DataContext'] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Bases: great_expectations.rule_based_profiler.parameter_builder.parameter_builder.ParameterBuilder

A Single/Multi-Batch implementation for obtaining a resolved (evaluated) metric, using domain_kwargs, value_kwargs, and metric_name as arguments.

property metric_name(self)
property metric_domain_kwargs(self)
property metric_value_kwargs(self)
property enforce_numeric_metric(self)
property replace_nan_with_zero(self)
property reduce_scalar_metric(self)
_build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)

Builds ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and optional details.

Returns

ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and

ptional details

class great_expectations.rule_based_profiler.parameter_builder.NumericMetricRangeMultiBatchParameterBuilder(name: str, metric_name: str, metric_domain_kwargs: Optional[Union[str, dict]] = None, metric_value_kwargs: Optional[Union[str, dict]] = None, sampling_method: str = 'bootstrap', enforce_numeric_metric: Union[str, bool] = True, replace_nan_with_zero: Union[str, bool] = True, reduce_scalar_metric: Union[str, bool] = True, false_positive_rate: Union[str, float] = 0.05, num_bootstrap_samples: Optional[Union[str, int]] = None, round_decimals: Optional[Union[str, int]] = None, truncate_values: Optional[Union[str, Dict[str, Union[Optional[int], Optional[float]]]]] = None, data_context: Optional['DataContext'] = None, batch_list: Optional[List[Batch]] = None, batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None)

Bases: great_expectations.rule_based_profiler.parameter_builder.parameter_builder.ParameterBuilder

A Multi-Batch implementation for obtaining the range estimation bounds for a resolved (evaluated) numeric metric, using domain_kwargs, value_kwargs, metric_name, and false_positive_rate (tolerance) as arguments.

This Multi-Batch ParameterBuilder is general in the sense that any metric that computes numbers can be accommodated. On the other hand, it is specific in the sense that the parameter names will always have the semantics of numeric ranges, which will incorporate the requirements, imposed by the configured false_positive_rate tolerances.

The implementation supports two methods of estimating parameter values from data: * bootstrapped (default) – a statistical technique (see “https://en.wikipedia.org/wiki/Bootstrapping_(statistics)”) * one-shot – assumes that metric values, computed on batch data, are normally distributed and computes the mean

and the standard error using the queried batches as the single sample of the distribution (fast, but inaccurate).

RECOGNIZED_SAMPLING_METHOD_NAMES :set
RECOGNIZED_TRUNCATE_DISTRIBUTION_KEYS :set
property metric_name(self)
property metric_domain_kwargs(self)
property metric_value_kwargs(self)
property sampling_method(self)
property enforce_numeric_metric(self)
property replace_nan_with_zero(self)
property reduce_scalar_metric(self)
property false_positive_rate(self)
property num_bootstrap_samples(self)
property round_decimals(self)
property truncate_values(self)
_build_parameters(self, parameter_container: ParameterContainer, domain: Domain, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
Builds ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and optional

details.

return

ParameterContainer object that holds ParameterNode objects with attribute name-value pairs and

ptional details

The algorithm operates according to the following steps: 1. Obtain batch IDs of interest using DataContext and BatchRequest (unless passed explicitly as argument). Note that this specific BatchRequest was specified as part of configuration for the present ParameterBuilder class. 2. Set up metric_domain_kwargs and metric_value_kwargs (using configuration and/or variables and parameters). 3. Instantiate the Validator object corresponding to BatchRequest (with a temporary expectation_suite_name) in

order to have access to all Batch objects, on each of which the specified metric_name will be computed.

  1. Perform metric computations and obtain the result in the array-like form (one metric value per each Batch).

  2. Using the configured directives and heuristics, determine whether or not the ranges should be clipped.

  3. Using the configured directives and heuristics, determine if return values should be rounded to an integer.

7. Convert the multi-dimensional metric computation results to a numpy array (for further computations). Steps 8 – 10 are for the “oneshot” sampling method only (the “bootstrap” method achieves same automatically): 8. Compute the mean and the standard deviation of the metric (aggregated over all the gathered Batch objects). 9. Compute number of standard deviations (as floating point) needed (around the mean) to achieve the specified

false_positive_rate (note that false_positive_rate of 0.0 would result in infinite number of standard deviations, hence it is “nudged” by small quantity “epsilon” above 0.0 if false_positive_rate of 0.0 appears as argument). (Please refer to “https://en.wikipedia.org/wiki/Normal_distribution” and references therein for background.)

  1. Compute the “band” around the mean as the min_value and max_value (to be used in ExpectationConfiguration).

  2. Return [low, high] for the desired metric as estimated by the specified sampling method.

  3. Set up the arguments and call build_parameter_container() to store the parameter as part of “rule state”.

_estimate_metric_value_range(self, metric_values: np.ndarray, estimator: Callable, domain: Optional[Domain] = None, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None, **kwargs)

This method accepts an estimator Callable and data samples in the format “N x R^m”, where “N” (most significant dimension) is the number of measurements (e.g., one per Batch of data), while “R^m” is the multi-dimensional metric, whose values are being estimated. Thus, for each element in the “R^m” hypercube, an “N”-dimensional vector of sample measurements is constructed and given to the estimator to apply its specific algorithm for computing the range of values in this vector. Estimator algorithms differ based on their use of data samples.

_get_truncate_values_using_heuristics(self, metric_values: np.ndarray, domain: Domain, *, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
_get_round_decimals_using_heuristics(self, metric_values: np.ndarray, domain: Domain, *, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None)
_get_bootstrap_estimate(self, metric_values: np.ndarray, domain: Domain, *, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None, **kwargs)
_get_deterministic_estimate(self, metric_values: np.ndarray, domain: Domain, *, variables: Optional[ParameterContainer] = None, parameters: Optional[Dict[str, ParameterContainer]] = None, **kwargs)