great_expectations.datasource.batch_kwargs_generator
¶
Submodules¶
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.databricks_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.manual_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.query_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.s3_subdir_reader_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.subdir_reader_batch_kwargs_generator
great_expectations.datasource.batch_kwargs_generator.table_batch_kwargs_generator
Package Contents¶
Classes¶
|
Meant to be used in a Databricks notebook |
|
GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data. |
|
ManualBatchKwargsGenerator returns manually-configured batch_kwargs for named data assets. It provides a |
|
Produce query-style batch_kwargs from sql files or defined queries. |
|
S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must |
|
The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs. |
|
The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs. |
|
Provide access to already materialized tables or views in a database. |
-
class
great_expectations.datasource.batch_kwargs_generator.
DatabricksTableBatchKwargsGenerator
(name='default', datasource=None, database='default')¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
Meant to be used in a Databricks notebook
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
_get_iterator
(self, data_asset_name, **kwargs)¶
-
-
class
great_expectations.datasource.batch_kwargs_generator.
GlobReaderBatchKwargsGenerator
(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.
A more interesting asset_glob might look like the following:
daily_logs: glob: daily_logs/*.csv partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv
The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.
A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:
my_datasource: class_name: PandasDatasource batch_kwargs_generators: my_generator: class_name: GlobReaderBatchKwargsGenerator base_directory: /var/log reader_options: sep: % header: 0 reader_method: csv asset_globs: wifi_logs: glob: wifi*.log partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log reader_method: csv
-
recognized_batch_parameters
¶
-
property
reader_options
(self)¶
-
property
asset_globs
(self)¶
-
property
reader_method
(self)¶
-
property
base_directory
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_build_batch_kwargs
(self, batch_parameters)¶
-
_get_data_asset_paths
(self, data_asset_name)¶ Returns a list of filepaths associated with the given data_asset_name
- Parameters
data_asset_name –
- Returns
paths (list)
-
_get_data_asset_config
(self, data_asset_name)¶
-
_get_iterator
(self, data_asset_name, reader_method=None, reader_options=None, limit=None)¶
-
_build_batch_kwargs_path_iter
(self, path_list, glob_config, reader_method=None, reader_options=None, limit=None)¶
-
_build_batch_kwargs_from_path
(self, path, glob_config, reader_method=None, reader_options=None, limit=None)¶
-
_partitioner
(self, path, glob_config)¶
-
-
class
great_expectations.datasource.batch_kwargs_generator.
ManualBatchKwargsGenerator
(name='default', datasource=None, assets=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
ManualBatchKwargsGenerator returns manually-configured batch_kwargs for named data assets. It provides a convenient way to capture complete batch requests without requiring the configuration of a more fully-featured batch kwargs generator.
A fully configured ManualBatchKwargsGenerator in yml might look like the following:
my_datasource: class_name: PandasDatasource batch_kwargs_generators: my_generator: class_name: ManualBatchKwargsGenerator assets: asset1: - partition_id: 1 path: /data/file_1.csv reader_options: sep: ; - partition_id: 2 path: /data/file_2.csv reader_options: header: 0 logs: path: data/log.csv
-
recognized_batch_parameters
¶
-
property
assets
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
_get_data_asset_config
(self, data_asset_name)¶
-
_get_iterator
(self, data_asset_name, **kwargs)¶
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_build_batch_kwargs
(self, batch_parameters)¶ Build batch kwargs from a partition id.
-
-
class
great_expectations.datasource.batch_kwargs_generator.
QueryBatchKwargsGenerator
(name='default', datasource=None, query_store_backend=None, queries=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
Produce query-style batch_kwargs from sql files or defined queries.
By default, a QueryBatchKwargsGenerator will look for queries in the
datasources/datasource_name/generators/generator_name
directory, and look for files ending in.sql
.For example, a file stored in
datasources/datasource_name/generators/generator_name/movies_by_date.sql
would allow you to access an asset calledmovies_by_date
Queries can be parameterized using $substitution.
Example configuration:
- queries:
class_name: QueryBatchKwargsGenerator query_store_backend:
class_name: TupleFilesystemStoreBackend filepath_suffix: .sql base_directory: queries
Example query template, to be stored in
queries/movies_by_date.sql
SELECT * FROM movies where ‘$start’::date <= release_date AND release_date <= ‘$end’::date;
Example usage:
- context.build_batch_kwargs(
“my_db”, “query_generator”, “movies_by_date”, “query_parameters”: {
“start”: “2020-01-01”, “end”: “2020-02-01”
}
-
recognized_batch_parameters
¶
-
_get_raw_query
(self, data_asset_name)¶
-
_get_iterator
(self, data_asset_name, query_parameters=None)¶
-
add_query
(self, generator_asset=None, query=None, data_asset_name=None)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
_build_batch_kwargs
(self, batch_parameters)¶ Build batch kwargs from a partition id.
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
class
great_expectations.datasource.batch_kwargs_generator.
S3GlobReaderBatchKwargsGenerator
(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).
Example configuration:
datasources: my_datasource: ... batch_kwargs_generator: my_s3_generator: class_name: S3GlobReaderBatchKwargsGenerator bucket: my_bucket.my_organization.priv reader_method: parquet # This will be automatically inferred from suffix where possible, but can be explicitly specified as well reader_options: # Note that reader options can be specified globally or per-asset sep: "," delimiter: "/" # Note that this is the delimiter for the BUCKET KEYS. By default it is "/" boto3_options: endpoint_url: $S3_ENDPOINT # Use the S3_ENDPOINT environment variable to determine which endpoint to use max_keys: 100 # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available assets: my_first_asset: prefix: my_first_asset/ regex_filter: .* # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex directory_assets: True # if True, the contents of the directory will be treated as one batch. Notice that this option does not work with Pandas, since Pandas does not support loading multiple files from and S3 bucket into a data frame. access_logs: prefix: access_logs regex_filter: access_logs/2019.*\.csv.gz sep: "~" max_keys: 100
-
recognized_batch_parameters
¶
-
property
reader_options
(self)¶
-
property
assets
(self)¶
-
property
bucket
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
_get_iterator
(self, data_asset_name, reader_method=None, reader_options=None, limit=None)¶
-
_build_batch_kwargs_path_iter
(self, path_list, reader_options=None, limit=None)¶
-
_build_batch_kwargs
(self, batch_parameters)¶
-
_build_batch_kwargs_from_key
(self, key, asset_config=None, reader_method=None, reader_options=None, limit=None)¶
-
_get_asset_options
(self, asset_config, iterator_dict)¶
-
_build_asset_iterator
(self, asset_config, iterator_dict, reader_method=None, reader_options=None, limit=None)¶
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_partitioner
(self, key, asset_config)¶
-
-
class
great_expectations.datasource.batch_kwargs_generator.
S3SubdirReaderBatchKwargsGenerator
(name='default', datasource=None, bucket=None, boto3_options=None, base_directory='/data', reader_options=None, known_extensions=None, reader_method=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs.
- SubdirReaderBatchKwargsGenerator recognizes data assets using two criteria:
for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json .csv.gz, tsv.gz, .feather, .pkl), it uses the name of the file without the extension
for other files or directories in ‘base_directory’, is uses the file or directory name
SubdirReaderBatchKwargsGenerator sees all files inside a directory of base_directory as batches of one datasource.
SubdirReaderBatchKwargsGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.
-
_default_reader_options
¶
-
recognized_batch_parameters
¶
-
property
reader_options
(self)¶
-
property
known_extensions
(self)¶
-
property
reader_method
(self)¶
-
property
base_directory
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_build_batch_kwargs
(self, batch_parameters)¶ - Parameters
batch_parameters –
- Returns
batch_kwargs
-
_get_valid_file_options
(self, base_directory=None)¶
-
_get_iterator
(self, data_asset_name, reader_options=None, limit=None)¶
-
_build_batch_kwargs_path_iter
(self, path_list, reader_options=None, limit=None)¶
-
_build_batch_kwargs_from_path
(self, path, reader_method=None, reader_options=None, limit=None)¶
-
_window_to_s3_path
(self, path: str)¶ To handle window “” path. “s3://bucketprefix” => “s3://bucket/prefix” >>> path = os.path.join(“s3://bucket”, “prefix”) >>> window_to_s3_path(path) >>>
-
class
great_expectations.datasource.batch_kwargs_generator.
SubdirReaderBatchKwargsGenerator
(name='default', datasource=None, base_directory='/data', reader_options=None, known_extensions=None, reader_method=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs.
- SubdirReaderBatchKwargsGenerator recognizes data assets using two criteria:
for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json .csv.gz, tsv.gz, .feather, .pkl), it uses the name of the file without the extension
for other files or directories in ‘base_directory’, is uses the file or directory name
SubdirReaderBatchKwargsGenerator sees all files inside a directory of base_directory as batches of one datasource.
SubdirReaderBatchKwargsGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.
-
_default_reader_options
¶
-
recognized_batch_parameters
¶
-
property
reader_options
(self)¶
-
property
known_extensions
(self)¶
-
property
reader_method
(self)¶
-
property
base_directory
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_build_batch_kwargs
(self, batch_parameters)¶ - Parameters
batch_parameters –
- Returns
batch_kwargs
-
_get_valid_file_options
(self, base_directory=None)¶
-
_get_iterator
(self, data_asset_name, reader_options=None, limit=None)¶
-
_build_batch_kwargs_path_iter
(self, path_list, reader_options=None, limit=None)¶
-
_build_batch_kwargs_from_path
(self, path, reader_method=None, reader_options=None, limit=None)¶
-
class
great_expectations.datasource.batch_kwargs_generator.
TableBatchKwargsGenerator
(name='default', datasource=None, assets=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
Provide access to already materialized tables or views in a database.
TableBatchKwargsGenerator can be used to define specific data asset names that take and substitute parameters, for example to support referring to the same data asset but with different schemas depending on provided batch_kwargs.
The python template language is used to substitute table name portions. For example, consider the following configurations:
my_generator: class_name: TableBatchKwargsGenerator assets: my_table: schema: $schema table: my_table
In that case, the asset my_datasource/my_generator/my_asset will refer to a table called my_table in a schema defined in batch_kwargs.
-
recognized_batch_parameters
¶
-
_get_iterator
(self, data_asset_name, query_parameters=None, limit=None, offset=None, partition_id=None)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
_build_batch_kwargs
(self, batch_parameters)¶
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-