Generator Module

class great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator(name, datasource)

BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

Example Generator Configurations follow:

my_datasource_1:
  class_name: PandasDatasource
  generators:
    # This generator will provide two data assets, corresponding to the globs defined under the "file_logs"
    # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group
    # defined in partition_regex
    default:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/logs
      reader_options:
        sep: "
      globs:
        file_logs:
          glob: logs/*.gz
          partition_regex: logs/file_(\d{0,4})_\.log\.gz
        data_asset_2:
          glob: data/*.csv

my_datasource_2:
  class_name: PandasDatasource
  generators:
    # This generator will create one data asset per subdirectory in /data
    # Each asset will have partitions corresponding to the filenames in that subdirectory
    default:
      class_name: SubdirReaderBatchKwargsGenerator
      reader_options:
        sep: "
      base_directory: /data

my_datasource_3:
  class_name: SqlalchemyDatasource
  generators:
    # This generator will search for a file named with the name of the requested generator asset and the
    # .sql suffix to open with a query to use to generate data
     default:
        class_name: QueryBatchKwargsGenerator
recognized_batch_parameters = {}
property name
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

get_config()
reset_iterator(generator_asset, **kwargs)
get_iterator(generator_asset, **kwargs)
build_batch_kwargs(name=None, partition_id=None, **kwargs)

The key workhorse. Docs forthcoming.

yield_batch_kwargs(generator_asset, **kwargs)

InMemoryGenerator

QueryBatchKwargsGenerator

class great_expectations.datasource.generator.query_generator.QueryBatchKwargsGenerator(name='default', datasource=None, query_store_backend=None, queries=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

Produce query-style batch_kwargs from sql files stored on disk

recognized_batch_parameters = {'partition_id', 'query_parameters'}
add_query(generator_asset, query)
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

SubdirReaderBatchKwargsGenerator

class great_expectations.datasource.generator.subdir_reader_generator.SubdirReaderBatchKwargsGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, known_extensions=None, reader_method=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs.

SubdirReaderBatchKwargsGenerator recognizes generator_assets using two criteria:
  • for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json), it uses the name of the file without the extension

  • for other files or directories in ‘base_directory’, is uses the file or directory name

SubdirReaderBatchKwargsGenerator sees all files inside a directory of base_directory as batches of one datasource.

SubdirReaderBatchKwargsGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.

recognized_batch_parameters = {'name', 'partition_id'}
property reader_options
property known_extensions
property reader_method
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

GlobReaderBatchKwargsGenerator

class great_expectations.datasource.generator.glob_reader_generator.GlobReaderBatchKwargsGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.

A more interesting asset_glob might look like the following:

daily_logs:
  glob: daily_logs/*.csv
  partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv

The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.

A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:

my_datasource:
  class_name: PandasDatasource
  generators:
    my_generator:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/log
      reader_options:
        sep: %
        header: 0
      reader_method: csv
      asset_globs:
        wifi_logs:
          glob: wifi*.log
          partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log
          reader_method: csv
recognized_batch_parameters = {'limit', 'name', 'reader_method', 'reader_options'}
property reader_options
property asset_globs
property reader_method
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

S3GlobReaderBatchKwargsGenerator

class great_expectations.datasource.generator.s3_generator.S3GlobReaderBatchKwargsGenerator(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

S3 Generator provides support for generating batches of data from an S3 bucket. For the S3 generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).

Example configuration:

datasources:
  my_datasource:
    ...
    generators:
      my_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        boto3_options:
          endpoint_url: $S3_ENDPOINT  # Use the S3_ENDPOINT environment variable to determine which endpoint to use
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
            dictionary_assets: True
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100
property reader_options
property assets
property bucket
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, reader_options=None, limit=None)
get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

DatabricksTableBatchKwargsGenerator

class great_expectations.datasource.generator.databricks_generator.DatabricksTableBatchKwargsGenerator(name='default', datasource=None, database='default')

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

Meant to be used in a Databricks notebook

get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

last updated: Aug 13, 2020