great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator

Module Contents

Classes

S3GlobReaderBatchKwargsGenerator(name=’default’, datasource=None, bucket=None, reader_options=None, assets=None, delimiter=’/’, reader_method=None, boto3_options=None, max_keys=1000)

S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must

great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.boto3
great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.logger
class great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.S3GlobReaderBatchKwargsGenerator(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)

Bases: great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator

S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).

Example configuration:

datasources:
  my_datasource:
    ...
    batch_kwargs_generator:
      my_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        boto3_options:
          endpoint_url: $S3_ENDPOINT  # Use the S3_ENDPOINT environment variable to determine which endpoint to use
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
            directory_assets: True # if True, the contents of the directory will be treated as one batch. Notice that this option does not work with Pandas, since Pandas does not support loading multiple files from and S3 bucket into a data frame.
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100
recognized_batch_parameters
property reader_options(self)
property assets(self)
property bucket(self)
get_available_data_asset_names(self)

Return the list of asset names known by this batch kwargs generator.

Returns

A list of available names

_get_iterator(self, data_asset_name, reader_method=None, reader_options=None, limit=None)
_build_batch_kwargs_path_iter(self, path_list, reader_options=None, limit=None)
_build_batch_kwargs(self, batch_parameters)
_build_batch_kwargs_from_key(self, key, asset_config=None, reader_method=None, reader_options=None, limit=None)
_get_asset_options(self, asset_config, iterator_dict)
_build_asset_iterator(self, asset_config, iterator_dict, reader_method=None, reader_options=None, limit=None)
get_available_partition_ids(self, generator_asset=None, data_asset_name=None)

Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

data_asset_name – the data asset whose partitions should be returned.

Returns

A list of partition_id strings

_partitioner(self, key, asset_config)