Datasource Module

class great_expectations.datasource.Datasource(name, data_context=None, data_asset_type=None, generators=None, **kwargs)

A Datasource connects to a compute environment and one or more storage environments and produces batches of data that Great Expectations can validate in that compute environment.

Each Datasource provides Batches connected to a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory Pandas DataFrame.

Datasources use Batch Kwargs to specify instructions for how to access data from relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.

To bridge the gap between those worlds, Datasources interact closely with generators which are aware of a source of data and can produce produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources and/or generators for a more generic datasource.

When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter to configure the datasource to load and return DataAssets of the custom type.

recognized_batch_parameters = {'limit'}
classmethod from_configuration(**kwargs)

Build a new datasource from a configuration dictionary.

Parameters

**kwargs – configuration key-value pairs

Returns

the newly-created datasource

Return type

datasource (Datasource)

classmethod build_configuration(class_name, module_name='great_expectations.datasource', data_asset_type=None, generators=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • class_name – The name of the class for which to build the config

  • module_name – The name of the module in which the datasource class is located

  • data_asset_type – A ClassConfig dictionary

  • generators – Generator configuration dictionary

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

property name

Property for datasource name

property config
property data_context

Property for attached DataContext

add_generator(name, class_name, **kwargs)

Add a generator to the datasource.

Parameters
  • name (str) – the name of the new generator to add

  • class_name – class of the generator to add

  • kwargs – additional keyword arguments will be passed directly to the new generator’s constructor

Returns

generator (Generator)

get_generator(generator_name)

Get the (named) generator from a datasource)

Parameters

generator_name (str) – name of generator (default value is ‘default’)

Returns

generator (Generator)

list_generators()

List currently-configured generators for this datasource.

Returns

each dictionary includes “name” and “type” keys

Return type

List(dict)

process_batch_parameters(limit=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters

limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

Returns

a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.

Return type

batch_parameters, batch_kwargs

get_batch(batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

get_available_data_asset_names(generator_names=None)

Returns a dictionary of data_asset_names that the specified generator can provide. Note that some generators may not be capable of describing specific named data assets, and some generators (such as filesystem glob generators) require the user to configure data asset names.

Parameters

generator_names – the generators for which to get available data asset names.

Returns

{
  generator_name: {
    names: [ (data_asset_1, data_asset_1_type), (data_asset_2, data_asset_2_type) ... ]
  }
  ...
}

Return type

dictionary consisting of sets of generator assets available for the specified generators

build_batch_kwargs(generator, name=None, partition_id=None, **kwargs)

PandasDatasource

class great_expectations.datasource.pandas_datasource.PandasDatasource(name='pandas', data_context=None, data_asset_type=None, generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

The PandasDatasource produces PandasDataset objects and supports generators capable of interacting with the local filesystem (the default subdir_reader generator), and from existing in-memory dataframes.

recognized_batch_parameters = {'limit', 'reader_method', 'reader_options'}
classmethod build_configuration(data_asset_type=None, generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • generators – Generator configuration dictionary

  • boto3_options – Optional dictionary with key-value pairs to pass to boto3 during instantiation.

  • reader_method – Optional default reader_method for generated batches

  • reader_options – Optional default reader_options for generated batches

  • limit – Optional default limit for generated batches

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

process_batch_parameters(reader_method=None, reader_options=None, limit=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters

limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

Returns

a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.

Return type

batch_parameters, batch_kwargs

get_batch(batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

static guess_reader_method_from_path(path)

SqlAlchemyDatasource

class great_expectations.datasource.sqlalchemy_datasource.SqlAlchemyDatasource(name='default', data_context=None, data_asset_type=None, credentials=None, generators=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:
  • if the batch_kwargs include a table key, the datasource will provide a dataset object connected to that table

  • if the batch_kwargs include a query key, the datasource will create a temporary table using that that query. The query can be parameterized according to the standard python Template engine, which uses $parameter, with additional kwargs passed to the get_batch method.

recognized_batch_parameters = {'limit', 'query_parameters'}
classmethod build_configuration(data_asset_type=None, generators=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • generators – Generator configuration dictionary

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

get_batch(batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

process_batch_parameters(query_parameters=None, limit=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters

limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

Returns

a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.

Return type

batch_parameters, batch_kwargs

SparkDFDatasource

class great_expectations.datasource.sparkdf_datasource.SparkDFDatasource(name='default', data_context=None, data_asset_type=None, generators=None, spark_config=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local filesystem (the default subdir_reader generator) and databricks notebooks.

Accepted Batch Kwargs:
  • PathBatchKwargs (“path” or “s3” keys)

  • InMemoryBatchKwargs (“dataset” key)

  • QueryBatchKwargs (“query” key)

recognized_batch_parameters = {'limit', 'reader_method', 'reader_options'}
classmethod build_configuration(data_asset_type=None, generators=None, spark_config=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • generators – Generator configuration dictionary

  • spark_config – dictionary of key-value pairs to pass to the spark builder

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

process_batch_parameters(reader_method=None, reader_options=None, limit=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters

limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

Returns

a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.

Return type

batch_parameters, batch_kwargs

get_batch(batch_kwargs, batch_parameters=None)

class-private implementation of get_data_asset

static guess_reader_method_from_path(path)

last updated: Aug 13, 2020