great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator
¶
Module Contents¶
Classes¶
|
BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources |
-
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.
logger
¶
-
class
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.
BatchKwargsGenerator
(name, datasource)¶ - BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources
can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.
For example, a batch kwargs generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”
A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the batch kwargs generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.
Example Generator Configurations follow:
my_datasource_1: class_name: PandasDatasource batch_kwargs_generators: # This generator will provide two data assets, corresponding to the globs defined under the "file_logs" # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group # defined in partition_regex default: class_name: GlobReaderBatchKwargsGenerator base_directory: /var/logs reader_options: sep: " globs: file_logs: glob: logs/*.gz partition_regex: logs/file_(\d{0,4})_\.log\.gz data_asset_2: glob: data/*.csv my_datasource_2: class_name: PandasDatasource batch_kwargs_generators: # This generator will create one data asset per subdirectory in /data # Each asset will have partitions corresponding to the filenames in that subdirectory default: class_name: SubdirReaderBatchKwargsGenerator reader_options: sep: " base_directory: /data my_datasource_3: class_name: SqlalchemyDatasource batch_kwargs_generators: # This generator will search for a file named with the name of the requested data asset and the # .sql suffix to open with a query to use to generate data default: class_name: QueryBatchKwargsGenerator
Batch Kwargs Generator - Manual - How-to GuideManually configure how files on a filesystem are presented as batches of dataMaturity: BetaDetails:API Stability: Mostly Stable (key generator functionality will remain but batch API changes still possible)Implementation Completeness: CompleteUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/ADocumentation Completeness: MinimalBug Risk: ModerateBatch Kwargs Generator - S3 - How-to GuidePresent files on S3 as batches of data for profiling and validationMaturity: BetaDetails:API Stability: Mostly Stable (expect changes in partitioning)Implementation Completeness: PartialUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: CompleteDocumentation Completeness: MinimalBug Risk: ModerateBatch Kwargs Generator - Glob Reader - How-to GuideA configurable way to present files in a directory as batches of dataMaturity: BetaDetails:API Stability: Mostly Stable (expect changes in partitioning)Implementation Completeness: PartialUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/ADocumentation Completeness: MinimalBug Risk: ModerateBatch Kwargs Generator - Table - How-to GuidePresent database tables as batches of data for validation and profilingMaturity: BetaDetails:API Stability: Unstable (no existing native support for “partitioning”)Implementation Completeness: MinimalUnit Test Coverage: PartialIntegration Infrastructure/Test Coverage: MinimalDocumentation Completeness: PartialBug Risk: LowBatch Kwargs Generator - Query - How-to GuidePresent the result sets of SQL queries as batches of data for validation and profilingMaturity: BetaDetails:API Stability: Unstable (expect changes in query template configuration and query storage)Implementation Completeness: CompleteUnit Test Coverage: PartialIntegration Infrastructure/Test Coverage: MinimalDocumentation Completeness: PartialBug Risk: ModerateBatch Kwargs Generator - Subdir Reader - How-to GuidePresent the files in a directory as batches of data for profiling and validation.Maturity: BetaDetails:API Stability: Mostly Stable (new configuration options likely)Implementation Completeness: PartialUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/ADocumentation Completeness: MinimalBug Risk: Low-
_batch_kwargs_type
¶
-
recognized_batch_parameters
¶
-
property
name
(self)¶
-
abstract
_get_iterator
(self, data_asset_name, **kwargs)¶
-
abstract
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
abstract
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
get_config
(self)¶
-
reset_iterator
(self, generator_asset=None, data_asset_name=None, **kwargs)¶
-
get_iterator
(self, generator_asset=None, data_asset_name=None, **kwargs)¶
-
build_batch_kwargs
(self, data_asset_name=None, partition_id=None, **kwargs)¶
-
abstract
_build_batch_kwargs
(self, batch_parameters)¶
-
yield_batch_kwargs
(self, data_asset_name=None, generator_asset=None, **kwargs)¶