great_expectations.dataset.util
¶
Module Contents¶
Functions¶
|
Tests whether a given object is a valid continuous or categorical partition object. |
|
Tests whether a given object is a valid categorical partition object. |
|
Tests whether a given object is a valid continuous partition object. See Partition Objects. |
Convenience method for creating weights from categorical data. |
|
|
Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth. |
|
|
|
Convenience method for building a partition object on continuous data |
|
Convenience method for building a partition object on continuous data from a dataset and column |
|
Convenience method for building a partition object on categorical data from a dataset and column |
|
Convenience method for determining the shape parameters of a given distribution |
|
Helper function that returns positional arguments for a scipy distribution using a dict of parameters. |
|
Ensures that necessary parameters for a distribution are present and that all parameters are sensical. |
|
Creates an identical expectation for each of the given columns with the specified arguments, if any. |
|
|
|
|
|
Validate mostly parameter is a number between 0 and 1 or None. |
-
great_expectations.dataset.util.
logger
¶
-
great_expectations.dataset.util.
DefaultDialect
¶
-
great_expectations.dataset.util.
is_valid_partition_object
(partition_object)¶ Tests whether a given object is a valid continuous or categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean
-
great_expectations.dataset.util.
is_valid_categorical_partition_object
(partition_object)¶ Tests whether a given object is a valid categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean
-
great_expectations.dataset.util.
is_valid_continuous_partition_object
(partition_object)¶ Tests whether a given object is a valid continuous partition object. See Partition Objects.
- Parameters
partition_object – The partition_object to evaluate
- Returns
Boolean
-
great_expectations.dataset.util.
categorical_partition_data
(data)¶ Convenience method for creating weights from categorical data.
- Parameters
data (list-like) – The data from which to construct the estimate.
- Returns
A new partition object:
{ "values": (list) The categorical values present in the data "weights": (list) The weights of the values in the partition. }
See Partition Objects.
-
great_expectations.dataset.util.
kde_partition_data
(data, estimate_tails=True)¶ Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.
- Parameters
data (list-like) – The data from which to construct the estimate
estimate_tails (bool) – Whether to estimate the tails of the distribution to keep the partition object finite
- Returns
A new partition_object:
{ "bins": (list) The endpoints of the partial partition of reals, "weights": (list) The densities of the bins implied by the partition. } See :ref:`partition_object`.
-
great_expectations.dataset.util.
partition_data
(data, bins='auto', n_bins=10)¶
-
great_expectations.dataset.util.
continuous_partition_data
(data, bins='auto', n_bins=10, **kwargs)¶ Convenience method for building a partition object on continuous data
- Parameters
data (list-like) – The data from which to construct the estimate.
bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)
n_bins (int) – Ignored if bins is auto.
kwargs (mapping) – Additional keyword arguments to be passed to numpy histogram
- Returns
A new partition_object:
{ "bins": (list) The endpoints of the partial partition of reals, "weights": (list) The densities of the bins implied by the partition. } See :ref:`partition_object`.
-
great_expectations.dataset.util.
build_continuous_partition_object
(dataset, column, bins='auto', n_bins=10, allow_relative_error=False)¶ Convenience method for building a partition object on continuous data from a dataset and column
- Parameters
dataset (GX Dataset) – the dataset for which to compute the partition
column (string) – The name of the column for which to construct the estimate.
bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)
n_bins (int) – Ignored if bins is auto.
allow_relative_error – passed to get_column_quantiles, set to False for only precise values, True to allow approximate values on systems with only binary choice (e.g. Redshift), and to a value between zero and one for systems that allow specification of relative error (e.g. SparkDFDataset).
- Returns
A new partition_object:
{ "bins": (list) The endpoints of the partial partition of reals, "weights": (list) The densities of the bins implied by the partition. } See :ref:`partition_object`.
-
great_expectations.dataset.util.
build_categorical_partition_object
(dataset, column, sort='value')¶ Convenience method for building a partition object on categorical data from a dataset and column
- Parameters
dataset (GX Dataset) – the dataset for which to compute the partition
column (string) – The name of the column for which to construct the estimate.
sort (string) – must be one of “value”, “count”, or “none”. - if “value” then values in the resulting partition object will be sorted lexigraphically - if “count” then values will be sorted according to descending count (frequency) - if “none” then values will not be sorted
- Returns
A new partition_object:
{ "values": (list) the categorical values for which each weight applies, "weights": (list) The densities of the values implied by the partition. } See :ref:`partition_object`.
-
great_expectations.dataset.util.
infer_distribution_parameters
(data, distribution, params=None)¶ Convenience method for determining the shape parameters of a given distribution
- Parameters
data (list-like) – The data to build shape parameters from.
distribution (string) – Scipy distribution, determines which parameters to build.
params (dict or None) – The known parameters. Parameters given here will not be altered. Keep as None to infer all necessary parameters from the data data.
- Returns
A dictionary of named parameters:
{ "mean": (float), "std_dev": (float), "loc": (float), "scale": (float), "alpha": (float), "beta": (float), "min": (float), "max": (float), "df": (float) } See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
-
great_expectations.dataset.util.
_scipy_distribution_positional_args_from_dict
(distribution, params)¶ Helper function that returns positional arguments for a scipy distribution using a dict of parameters.
See the cdf() function here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html#Methods to see an example of scipy’s positional arguments. This function returns the arguments specified by the scipy.stat.distribution.cdf() for that distribution.
- Parameters
distribution (string) – The scipy distribution name.
params (dict) – A dict of named parameters.
- Raises
AttributeError – If an unsupported distribution is provided.
-
great_expectations.dataset.util.
validate_distribution_parameters
(distribution, params)¶ Ensures that necessary parameters for a distribution are present and that all parameters are sensical.
If parameters necessary to construct a distribution are missing or invalid, this function raises ValueError with an informative description. Note that ‘loc’ and ‘scale’ are optional arguments, and that ‘scale’ must be positive.
- Parameters
distribution (string) – The scipy distribution name, e.g. normal distribution is ‘norm’.
params (dict or list) –
The distribution shape parameters in a named dictionary or positional list form following the scipy cdf argument scheme.
params={‘mean’: 40, ‘std_dev’: 5} or params=[40, 5]
- Exceptions:
ValueError: With an informative description, usually when necessary parameters are omitted or are invalid.
-
great_expectations.dataset.util.
create_multiple_expectations
(df, columns, expectation_type, *args, **kwargs)¶ Creates an identical expectation for each of the given columns with the specified arguments, if any.
- Parameters
df (great_expectations.dataset) – A great expectations dataset object.
columns (list) – A list of column names represented as strings.
expectation_type (string) – The expectation type.
- Raises
KeyError if the provided column does not exist. –
AttributeError if the provided expectation type does not exist or df is not a valid great expectations dataset. –
- Returns
A list of expectation results.
-
great_expectations.dataset.util.
get_approximate_percentile_disc_sql
(selects: List, sql_engine_dialect: Any) → str¶
-
great_expectations.dataset.util.
check_sql_engine_dialect
(actual_sql_engine_dialect: Any, candidate_sql_engine_dialect: Any) → bool¶
-
great_expectations.dataset.util.
validate_mostly
(mostly: Optional[Union[int, float]]) → None¶ Validate mostly parameter is a number between 0 and 1 or None.
- Parameters
mostly – The mostly parameter for an expectation configuration.
- Raises
AssertionError – Raised is mostly is defined and not a number between 0 and 1.