Generator Module¶
-
class
great_expectations.datasource.generator.batch_kwargs_generator.
BatchKwargsGenerator
(name, datasource)¶ BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.
For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”
A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.
Example Generator Configurations follow:
my_datasource_1: class_name: PandasDatasource generators: # This generator will provide two data assets, corresponding to the globs defined under the "file_logs" # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group # defined in partition_regex default: class_name: GlobReaderBatchKwargsGenerator base_directory: /var/logs reader_options: sep: " globs: file_logs: glob: logs/*.gz partition_regex: logs/file_(\d{0,4})_\.log\.gz data_asset_2: glob: data/*.csv my_datasource_2: class_name: PandasDatasource generators: # This generator will create one data asset per subdirectory in /data # Each asset will have partitions corresponding to the filenames in that subdirectory default: class_name: SubdirReaderBatchKwargsGenerator reader_options: sep: " base_directory: /data my_datasource_3: class_name: SqlalchemyDatasource generators: # This generator will search for a file named with the name of the requested generator asset and the # .sql suffix to open with a query to use to generate data default: class_name: QueryBatchKwargsGenerator
-
recognized_batch_parameters
= {}¶
-
property
name
¶
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
get_available_partition_ids
(generator_asset)¶ Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
generator_asset – the generator asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
get_config
()¶
-
reset_iterator
(generator_asset, **kwargs)¶
-
get_iterator
(generator_asset, **kwargs)¶
-
build_batch_kwargs
(name=None, partition_id=None, **kwargs)¶ The key workhorse. Docs forthcoming.
-
yield_batch_kwargs
(generator_asset, **kwargs)¶
-
InMemoryGenerator¶
QueryBatchKwargsGenerator¶
-
class
great_expectations.datasource.generator.query_generator.
QueryBatchKwargsGenerator
(name='default', datasource=None, query_store_backend=None, queries=None)¶ Bases:
great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator
Produce query-style batch_kwargs from sql files stored on disk
-
recognized_batch_parameters
= {'partition_id', 'query_parameters'}¶
-
add_query
(generator_asset, query)¶
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
get_available_partition_ids
(generator_asset)¶ Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
generator_asset – the generator asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
SubdirReaderBatchKwargsGenerator¶
-
class
great_expectations.datasource.generator.subdir_reader_generator.
SubdirReaderBatchKwargsGenerator
(name='default', datasource=None, base_directory='/data', reader_options=None, known_extensions=None, reader_method=None)¶ Bases:
great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator
The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs.
- SubdirReaderBatchKwargsGenerator recognizes generator_assets using two criteria:
for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json), it uses the name of the file without the extension
for other files or directories in ‘base_directory’, is uses the file or directory name
SubdirReaderBatchKwargsGenerator sees all files inside a directory of base_directory as batches of one datasource.
SubdirReaderBatchKwargsGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.
-
recognized_batch_parameters
= {'name', 'partition_id'}¶
-
property
reader_options
¶
-
property
known_extensions
¶
-
property
reader_method
¶
-
property
base_directory
¶
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
get_available_partition_ids
(generator_asset)¶ Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
generator_asset – the generator asset whose partitions should be returned.
- Returns
A list of partition_id strings
GlobReaderBatchKwargsGenerator¶
-
class
great_expectations.datasource.generator.glob_reader_generator.
GlobReaderBatchKwargsGenerator
(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)¶ Bases:
great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator
GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.
A more interesting asset_glob might look like the following:
daily_logs: glob: daily_logs/*.csv partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv
The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.
A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:
my_datasource: class_name: PandasDatasource generators: my_generator: class_name: GlobReaderBatchKwargsGenerator base_directory: /var/log reader_options: sep: % header: 0 reader_method: csv asset_globs: wifi_logs: glob: wifi*.log partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log reader_method: csv
-
recognized_batch_parameters
= {'limit', 'name', 'reader_method', 'reader_options'}¶
-
property
reader_options
¶
-
property
asset_globs
¶
-
property
reader_method
¶
-
property
base_directory
¶
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
get_available_partition_ids
(generator_asset)¶ Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
generator_asset – the generator asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
S3GlobReaderBatchKwargsGenerator¶
-
class
great_expectations.datasource.generator.s3_generator.
S3GlobReaderBatchKwargsGenerator
(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)¶ Bases:
great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator
S3 Generator provides support for generating batches of data from an S3 bucket. For the S3 generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).
Example configuration:
datasources: my_datasource: ... generators: my_s3_generator: class_name: S3GlobReaderBatchKwargsGenerator bucket: my_bucket.my_organization.priv reader_method: parquet # This will be automatically inferred from suffix where possible, but can be explicitly specified as well reader_options: # Note that reader options can be specified globally or per-asset sep: "," delimiter: "/" # Note that this is the delimiter for the BUCKET KEYS. By default it is "/" boto3_options: endpoint_url: $S3_ENDPOINT # Use the S3_ENDPOINT environment variable to determine which endpoint to use max_keys: 100 # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available assets: my_first_asset: prefix: my_first_asset/ regex_filter: .* # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex dictionary_assets: True access_logs: prefix: access_logs regex_filter: access_logs/2019.*\.csv.gz sep: "~" max_keys: 100
-
property
reader_options
¶
-
property
assets
¶
-
property
bucket
¶
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
build_batch_kwargs_from_partition_id
(generator_asset, partition_id=None, reader_options=None, limit=None)¶
-
get_available_partition_ids
(generator_asset)¶ Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
generator_asset – the generator asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
property
DatabricksTableBatchKwargsGenerator¶
-
class
great_expectations.datasource.generator.databricks_generator.
DatabricksTableBatchKwargsGenerator
(name='default', datasource=None, database='default')¶ Bases:
great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator
Meant to be used in a Databricks notebook
-
get_available_data_asset_names
()¶ Return the list of asset names known by this generator.
- Returns
A list of available names
-
last updated: Aug 13, 2020