great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator

Module Contents

Classes

BatchKwargsGenerator(name, datasource)

BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources

great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.logger
class great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator(name, datasource)
BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources

can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a batch kwargs generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the batch kwargs generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

Example Generator Configurations follow:

my_datasource_1:
  class_name: PandasDatasource
  batch_kwargs_generators:
    # This generator will provide two data assets, corresponding to the globs defined under the "file_logs"
    # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group
    # defined in partition_regex
    default:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/logs
      reader_options:
        sep: "
      globs:
        file_logs:
          glob: logs/*.gz
          partition_regex: logs/file_(\d{0,4})_\.log\.gz
        data_asset_2:
          glob: data/*.csv

my_datasource_2:
  class_name: PandasDatasource
  batch_kwargs_generators:
    # This generator will create one data asset per subdirectory in /data
    # Each asset will have partitions corresponding to the filenames in that subdirectory
    default:
      class_name: SubdirReaderBatchKwargsGenerator
      reader_options:
        sep: "
      base_directory: /data

my_datasource_3:
  class_name: SqlalchemyDatasource
  batch_kwargs_generators:
    # This generator will search for a file named with the name of the requested data asset and the
    # .sql suffix to open with a query to use to generate data
     default:
        class_name: QueryBatchKwargsGenerator

Feature Maturity

icon-8a05a27ef62f11eb87140242ac110002 Batch Kwargs Generator - Manual - How-to Guide
Manually configure how files on a filesystem are presented as batches of data
Maturity: Beta
Details:
API Stability: Mostly Stable (key generator functionality will remain but batch API changes still possible)
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A
Documentation Completeness: Minimal
Bug Risk: Moderate
icon-8a05a472f62f11eb87140242ac110002 Batch Kwargs Generator - S3 - How-to Guide
Present files on S3 as batches of data for profiling and validation
Maturity: Beta
Details:
API Stability: Mostly Stable (expect changes in partitioning)
Implementation Completeness: Partial
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: Complete
Documentation Completeness: Minimal
Bug Risk: Moderate
icon-8a05a562f62f11eb87140242ac110002 Batch Kwargs Generator - Glob Reader - How-to Guide
A configurable way to present files in a directory as batches of data
Maturity: Beta
Details:
API Stability: Mostly Stable (expect changes in partitioning)
Implementation Completeness: Partial
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A
Documentation Completeness: Minimal
Bug Risk: Moderate
icon-8a05a634f62f11eb87140242ac110002 Batch Kwargs Generator - Table - How-to Guide
Present database tables as batches of data for validation and profiling
Maturity: Beta
Details:
API Stability: Unstable (no existing native support for “partitioning”)
Implementation Completeness: Minimal
Unit Test Coverage: Partial
Integration Infrastructure/Test Coverage: Minimal
Documentation Completeness: Partial
Bug Risk: Low
icon-8a05a6f2f62f11eb87140242ac110002 Batch Kwargs Generator - Query - How-to Guide
Present the result sets of SQL queries as batches of data for validation and profiling
Maturity: Beta
Details:
API Stability: Unstable (expect changes in query template configuration and query storage)
Implementation Completeness: Complete
Unit Test Coverage: Partial
Integration Infrastructure/Test Coverage: Minimal
Documentation Completeness: Partial
Bug Risk: Moderate
icon-8a05a7b0f62f11eb87140242ac110002 Batch Kwargs Generator - Subdir Reader - How-to Guide
Present the files in a directory as batches of data for profiling and validation.
Maturity: Beta
Details:
API Stability: Mostly Stable (new configuration options likely)
Implementation Completeness: Partial
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A
Documentation Completeness: Minimal
Bug Risk: Low
_batch_kwargs_type
recognized_batch_parameters
property name(self)
abstract _get_iterator(self, data_asset_name, **kwargs)
abstract get_available_data_asset_names(self)

Return the list of asset names known by this batch kwargs generator.

Returns

A list of available names

abstract get_available_partition_ids(self, generator_asset=None, data_asset_name=None)

Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

data_asset_name – the data asset whose partitions should be returned.

Returns

A list of partition_id strings

get_config(self)
reset_iterator(self, generator_asset=None, data_asset_name=None, **kwargs)
get_iterator(self, generator_asset=None, data_asset_name=None, **kwargs)
build_batch_kwargs(self, data_asset_name=None, partition_id=None, **kwargs)
abstract _build_batch_kwargs(self, batch_parameters)
yield_batch_kwargs(self, data_asset_name=None, generator_asset=None, **kwargs)