Datasource Module¶
-
class
great_expectations.datasource.
Datasource
(name, data_context=None, data_asset_type=None, generators=None, **kwargs)¶ A Datasource connects to a compute environment and one or more storage environments and produces batches of data that Great Expectations can validate in that compute environment.
Each Datasource provides Batches connected to a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory Pandas DataFrame.
Datasources use Batch Kwargs to specify instructions for how to access data from relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.
To bridge the gap between those worlds, Datasources interact closely with generators which are aware of a source of data and can produce produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.
For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources and/or generators for a more generic datasource.
When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter to configure the datasource to load and return DataAssets of the custom type.
-
recognized_batch_parameters
= {'limit'}¶
-
classmethod
from_configuration
(**kwargs)¶ Build a new datasource from a configuration dictionary.
- Parameters
**kwargs – configuration key-value pairs
- Returns
the newly-created datasource
- Return type
datasource (Datasource)
-
classmethod
build_configuration
(class_name, module_name='great_expectations.datasource', data_asset_type=None, generators=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
class_name – The name of the class for which to build the config
module_name – The name of the module in which the datasource class is located
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
property
name
¶ Property for datasource name
-
property
config
¶
-
property
data_context
¶ Property for attached DataContext
-
add_generator
(name, class_name, **kwargs)¶ Add a generator to the datasource.
- Parameters
name (str) – the name of the new generator to add
class_name – class of the generator to add
kwargs – additional keyword arguments will be passed directly to the new generator’s constructor
- Returns
generator (Generator)
-
get_generator
(generator_name)¶ Get the (named) generator from a datasource)
- Parameters
generator_name (str) – name of generator (default value is ‘default’)
- Returns
generator (Generator)
-
list_generators
()¶ List currently-configured generators for this datasource.
- Returns
each dictionary includes “name” and “type” keys
- Return type
List(dict)
-
process_batch_parameters
(limit=None)¶ Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.
- Parameters
limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.
- Returns
a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.
- Return type
batch_parameters, batch_kwargs
-
get_batch
(batch_kwargs, batch_parameters=None)¶ Get a batch of data from the datasource.
- Parameters
batch_kwargs – the BatchKwargs to use to construct the batch
batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.
- Returns
Batch
-
get_available_data_asset_names
(generator_names=None)¶ Returns a dictionary of data_asset_names that the specified generator can provide. Note that some generators may not be capable of describing specific named data assets, and some generators (such as filesystem glob generators) require the user to configure data asset names.
- Parameters
generator_names – the generators for which to get available data asset names.
- Returns
{ generator_name: { names: [ (data_asset_1, data_asset_1_type), (data_asset_2, data_asset_2_type) ... ] } ... }
- Return type
dictionary consisting of sets of generator assets available for the specified generators
-
build_batch_kwargs
(generator, name=None, partition_id=None, **kwargs)¶
-
PandasDatasource¶
-
class
great_expectations.datasource.pandas_datasource.
PandasDatasource
(name='pandas', data_context=None, data_asset_type=None, generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
The PandasDatasource produces PandasDataset objects and supports generators capable of interacting with the local filesystem (the default subdir_reader generator), and from existing in-memory dataframes.
-
recognized_batch_parameters
= {'limit', 'reader_method', 'reader_options'}¶
-
classmethod
build_configuration
(data_asset_type=None, generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
boto3_options – Optional dictionary with key-value pairs to pass to boto3 during instantiation.
reader_method – Optional default reader_method for generated batches
reader_options – Optional default reader_options for generated batches
limit – Optional default limit for generated batches
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
process_batch_parameters
(reader_method=None, reader_options=None, limit=None)¶ Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.
- Parameters
limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.
- Returns
a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.
- Return type
batch_parameters, batch_kwargs
-
get_batch
(batch_kwargs, batch_parameters=None)¶ Get a batch of data from the datasource.
- Parameters
batch_kwargs – the BatchKwargs to use to construct the batch
batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.
- Returns
Batch
-
static
guess_reader_method_from_path
(path)¶
-
SqlAlchemyDatasource¶
-
class
great_expectations.datasource.sqlalchemy_datasource.
SqlAlchemyDatasource
(name='default', data_context=None, data_asset_type=None, credentials=None, generators=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
- A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:
if the batch_kwargs include a table key, the datasource will provide a dataset object connected to that table
if the batch_kwargs include a query key, the datasource will create a temporary table using that that query. The query can be parameterized according to the standard python Template engine, which uses $parameter, with additional kwargs passed to the get_batch method.
-
recognized_batch_parameters
= {'limit', 'query_parameters'}¶
-
classmethod
build_configuration
(data_asset_type=None, generators=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
get_batch
(batch_kwargs, batch_parameters=None)¶ Get a batch of data from the datasource.
- Parameters
batch_kwargs – the BatchKwargs to use to construct the batch
batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.
- Returns
Batch
-
process_batch_parameters
(query_parameters=None, limit=None)¶ Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.
- Parameters
limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.
- Returns
a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.
- Return type
batch_parameters, batch_kwargs
SparkDFDatasource¶
-
class
great_expectations.datasource.sparkdf_datasource.
SparkDFDatasource
(name='default', data_context=None, data_asset_type=None, generators=None, spark_config=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local filesystem (the default subdir_reader generator) and databricks notebooks.
- Accepted Batch Kwargs:
PathBatchKwargs (“path” or “s3” keys)
InMemoryBatchKwargs (“dataset” key)
QueryBatchKwargs (“query” key)
-
recognized_batch_parameters
= {'limit', 'reader_method', 'reader_options'}¶
-
classmethod
build_configuration
(data_asset_type=None, generators=None, spark_config=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
spark_config – dictionary of key-value pairs to pass to the spark builder
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
process_batch_parameters
(reader_method=None, reader_options=None, limit=None)¶ Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.
- Parameters
limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.
- Returns
a tuple containing all defined batch_parameters and batch_kwargs. Result will include both parameters passed via argument and configured parameters.
- Return type
batch_parameters, batch_kwargs
-
get_batch
(batch_kwargs, batch_parameters=None)¶ class-private implementation of get_data_asset
-
static
guess_reader_method_from_path
(path)¶
last updated: Aug 13, 2020