Generator Module

class great_expectations.datasource.generator.batch_generator.BatchGenerator(name, datasource=None)

Generators produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

Example Generator Configurations follow:

my_datasource_1:
  class_name: PandasDatasource
  generators:
    # This generator will provide two data assets, corresponding to the globs defined under the "file_logs"
    # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group
    # defined in partition_regex
    default:
      class_name: GlobReaderGenerator
      base_directory: /var/logs
      reader_options:
        sep: "
      globs:
        file_logs:
          glob: logs/*.gz
          partition_regex: logs/file_(\d{0,4})_\.log\.gz
        data_asset_2:
          glob: data/*.csv

my_datasource_2:
  class_name: PandasDatasource
  generators:
    # This generator will create one data asset per subdirectory in /data
    # Each asset will have partitions corresponding to the filenames in that subdirectory
    default:
      class_name: SubdirReaderGenerator
      reader_options:
        sep: "
      base_directory: /data

my_datasource_3:
  class_name: SqlalchemyDatasource
  generators:
    # This generator will search for a file named with the name of the requested generator asset and the
    # .sql suffix to open with a query to use to generate data
     default:
        class_name: QueryGenerator
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

get_config()
reset_iterator(data_asset_name, **kwargs)
get_iterator(data_asset_name, **kwargs)
build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs for the named generator_asset based on partition_id and optionally existing batch_kwargs. :param generator_asset: the generator_asset for which to build batch_kwargs :param partition_id: the partition id :param batch_kwargs: any existing batch_kwargs object to use. Will be supplemented with configured information. :param **kwargs: any addition kwargs to use. Will be added to returned batch_kwargs

Returns: BatchKwargs object

yield_batch_kwargs(data_asset_name, **kwargs)

InMemoryGenerator

class great_expectations.datasource.generator.in_memory_generator.InMemoryGenerator(name='default', datasource=None)

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

A basic generator that simply captures an existing object.

get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs for the named generator_asset based on partition_id and optionally existing batch_kwargs. :param generator_asset: the generator_asset for which to build batch_kwargs :param partition_id: the partition id :param batch_kwargs: any existing batch_kwargs object to use. Will be supplemented with configured information. :param **kwargs: any addition kwargs to use. Will be added to returned batch_kwargs

Returns: BatchKwargs object

QueryGenerator

class great_expectations.datasource.generator.query_generator.QueryGenerator(name='default', datasource=None, queries=None)

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

Produce query-style batch_kwargs from sql files stored on disk

add_query(data_asset_name, query)
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs from a partition id.

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

SubdirReaderGenerator

class great_expectations.datasource.generator.subdir_reader_generator.SubdirReaderGenerator(name='default', datasource=None, base_directory='/data', reader_options=None)

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

The SubdirReaderGenerator inspects a filesytem and produces batch_kwargs with a path and timestamp.

SubdirReaderGenerator recognizes generator_asset using two criteria:
  • for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json), it uses the name of the file without the extension

  • for other files or directories in ‘base_directory’, is uses the file or directory name

SubdirReaderGenerator sees all files inside a directory of base_directory as batches of one datasource.

SubdirReaderGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.

property reader_options
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs for the named generator_asset based on partition_id and optionally existing batch_kwargs. :param generator_asset: the generator_asset for which to build batch_kwargs :param partition_id: the partition id :param batch_kwargs: any existing batch_kwargs object to use. Will be supplemented with configured information. :param **kwargs: any addition kwargs to use. Will be added to returned batch_kwargs

Returns: BatchKwargs object

GlobReaderGenerator

class great_expectations.datasource.generator.glob_reader_generator.GlobReaderGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None)

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

GlobReaderGenerator processes files in a directory according to glob patterns to produce batches of data.

A more interesting asset_glob might look like the following:

daily_logs:
  glob: daily_logs/*.csv
  partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv

The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.

A fully configured GlobReaderGenerator in yml might look like the following:

my_datasource:
  class_name: PandasDatasource
  generators:
    my_generator:
      class_name: GlobReaderGenerator
      base_directory: /var/log
      reader_options:
        sep: %
        header: 0
      asset_globs:
        wifi_logs:
          glob: wifi*.log
          partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log
property reader_options
property asset_globs
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs from a partition id.

S3Generator

class great_expectations.datasource.generator.s3_generator.S3Generator(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, max_keys=1000)

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

S3 Generator provides support for generating batches of data from an S3 bucket. For the S3 generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).

Example configuration:

datasources:
  my_datasource:
    ...
    generators:
      my_s3_generator:
        class_name: S3Generator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100
property reader_options
property assets
property bucket
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, batch_kwargs=None, **kwargs)

Build batch kwargs for the named generator_asset based on partition_id and optionally existing batch_kwargs. :param generator_asset: the generator_asset for which to build batch_kwargs :param partition_id: the partition id :param batch_kwargs: any existing batch_kwargs object to use. Will be supplemented with configured information. :param **kwargs: any addition kwargs to use. Will be added to returned batch_kwargs

Returns: BatchKwargs object

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

DatabricksTableGenerator

class great_expectations.datasource.generator.databricks_generator.DatabricksTableGenerator(name='default', datasource=None, database='default')

Bases: great_expectations.datasource.generator.batch_generator.BatchGenerator

Meant to be used in a Databricks notebook

get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names