Datasource Module

class great_expectations.datasource.Datasource(name, type_, data_context=None, generators=None)

Datasources are responsible for connecting to data infrastructure. Each Datasource is a source of materialized data, such as a SQL database, S3 bucket, or local file directory.

Each Datasource also provides access to Great Expectations data assets that are connected to a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory Pandas Dataframe.

To bridge the gap between those worlds, Datasources interact closely with generators which are aware of a source of data and can produce produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources and/or generators for a more generic datasource.

classmethod from_configuration(**kwargs)

Build a new datasource from a configuration dictionary.

Parameters:**kwargs – configuration key-value pairs
Returns:the newly-created datasource
Return type:datasource (Datasource)
data_context

Property for attached DataContext

name

Property for datasource name

get_credentials(profile_name)

Return credentials for the named profile from the attached data context.

Parameters:profile_name

Returns:

get_config()

Get the current configuration.

Returns:datasource configuration dictionary
save_config()

Save the datasource config.

If there is no attached DataContext, a datasource will save its config in the current directory in a file called “great_expectations.yml

Returns:None
add_generator(name, type_, **kwargs)

Add a generator to the datasource.

The generator type_ must be one of the recognized types for the datasource.

Parameters:
  • name (str) – the name of the new generator to add
  • type_ (str) – the type of the new generator to add
  • kwargs – additional keyword arguments will be passed directly to the new generator’s constructor
Returns:

generator (Generator)

get_generator(generator_name='default')

Get the (named) generator from a datasource)

Parameters:generator_name (str) – name of generator (default value is ‘default’)
Returns:generator (Generator)
list_generators()

List currently-configured generators for this datasource.

Returns:each dictionary includes “name” and “type” keys
Return type:List(dict)
get_batch(data_asset_name, expectation_suite_name='default', batch_kwargs=None, **kwargs)

Get a batch of data from the datasource.

If a DataContext is attached, then expectation_suite_name can be used to define an expectation suite to attach to the data_asset being fetched. Otherwise, the expectation suite will be empty.

If no batch_kwargs are specified, the next kwargs for the named data_asset will be fetched from the generator first.

Specific datasource types implement the internal _get_data_asset method to use appropriate batch_kwargs to construct and return GE data_asset objects.

Parameters:
  • data_asset_name – the name of the data asset for which to fetch data.
  • expectation_suite_name – the name of the expectation suite to attach to the batch
  • batch_kwargs – dictionary of key-value pairs describing the batch to get, or a single identifier if that can be unambiguously translated to batch_kwargs
  • **kwargs – Additional key-value pairs to pass to the datasource, such as reader parameters
Returns:

A data_asset consisting of the specified batch of data with the named expectation suite connected.

get_available_data_asset_names(generator_names=None)

Returns a dictionary of data_asset_names that the specified generator can provide. Note that some generators, such as the “no-op” in-memory generator may not be capable of describing specific named data assets, and some generators (such as filesystem glob generators) require the user to configure data asset names.

Parameters:generator_names – the generators for which to fetch available data asset names.
Returns:
{
  generator_name: [ data_asset_1, data_asset_2, ... ]
  ...
}
Return type:dictionary consisting of sets of generator assets available for the specified generators
build_batch_kwargs(*args, **kwargs)

Datasource-specific logic that can handle translation of in-line batch identification information to batch_kwargs understandable by the provided datasource.

For example, a PandasDatasource may construct a filesystem path from positional arguments to provide an easy way of specifying the batch needed by the user.

Parameters:
  • *args – positional arguments used by the datasource
  • **kwargs – key-value pairs used by the datasource
Returns:

a batch_kwargs dictionary understandable by the datasource

get_data_context()

Getter for the currently-configured data context.

great_expectations.datasource.pandas_source

class great_expectations.datasource.pandas_source.PandasDatasource(name='pandas', data_context=None, generators=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

The PandasDatasource produces PandasDataset objects and supports generators capable of interacting with the local filesystem (the default subdir_reader generator), and from existing in-memory dataframes.

build_batch_kwargs(*args, **kwargs)

Datasource-specific logic that can handle translation of in-line batch identification information to batch_kwargs understandable by the provided datasource.

For example, a PandasDatasource may construct a filesystem path from positional arguments to provide an easy way of specifying the batch needed by the user.

Parameters:
  • *args – positional arguments used by the datasource
  • **kwargs – key-value pairs used by the datasource
Returns:

a batch_kwargs dictionary understandable by the datasource

great_expectations.datasource.sqlalchemy_source

class great_expectations.datasource.sqlalchemy_source.SqlAlchemyDatasource(name='default', data_context=None, profile=None, generators=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:
  • if the batch_kwargs include a table key, the datasource will provide a dataset object connected to that table
  • if the batch_kwargs include a query key, the datasource will create a temporary table using that that query. The query can be parameterized according to the standard python Template engine, which uses $parameter, with additional kwargs passed to the get_batch method.
build_batch_kwargs(*args, **kwargs)

Magically build batch_kwargs by guessing that the first non-keyword argument is a table name

great_expectations.datasource.spark_source

class great_expectations.datasource.spark_source.SparkDFDatasource(name='default', data_context=None, generators=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local filesystem (the default subdir_reader generator) and databricks notebooks.

build_batch_kwargs(*args, **kwargs)

Datasource-specific logic that can handle translation of in-line batch identification information to batch_kwargs understandable by the provided datasource.

For example, a PandasDatasource may construct a filesystem path from positional arguments to provide an easy way of specifying the batch needed by the user.

Parameters:
  • *args – positional arguments used by the datasource
  • **kwargs – key-value pairs used by the datasource
Returns:

a batch_kwargs dictionary understandable by the datasource