great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator
¶
Module Contents¶
Classes¶
|
GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data. |
-
great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator.
logger
¶
-
class
great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator.
GlobReaderBatchKwargsGenerator
(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)¶ Bases:
great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator
GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.
A more interesting asset_glob might look like the following:
daily_logs: glob: daily_logs/*.csv partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv
The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.
A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:
my_datasource: class_name: PandasDatasource batch_kwargs_generators: my_generator: class_name: GlobReaderBatchKwargsGenerator base_directory: /var/log reader_options: sep: % header: 0 reader_method: csv asset_globs: wifi_logs: glob: wifi*.log partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log reader_method: csv
-
recognized_batch_parameters
¶
-
property
reader_options
(self)¶
-
property
asset_globs
(self)¶
-
property
reader_method
(self)¶
-
property
base_directory
(self)¶
-
get_available_data_asset_names
(self)¶ Return the list of asset names known by this batch kwargs generator.
- Returns
A list of available names
-
get_available_partition_ids
(self, generator_asset=None, data_asset_name=None)¶ Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.
- Parameters
data_asset_name – the data asset whose partitions should be returned.
- Returns
A list of partition_id strings
-
_build_batch_kwargs
(self, batch_parameters)¶
-
_get_data_asset_paths
(self, data_asset_name)¶ Returns a list of filepaths associated with the given data_asset_name
- Parameters
data_asset_name –
- Returns
paths (list)
-
_get_data_asset_config
(self, data_asset_name)¶
-
_get_iterator
(self, data_asset_name, reader_method=None, reader_options=None, limit=None)¶
-
_build_batch_kwargs_path_iter
(self, path_list, glob_config, reader_method=None, reader_options=None, limit=None)¶
-
_build_batch_kwargs_from_path
(self, path, glob_config, reader_method=None, reader_options=None, limit=None)¶
-
_partitioner
(self, path, glob_config)¶
-