Retrieve a Batch of sample data
Expectations can be individually validated against a Batch of data. This allows you to test newly created Expectations, or to create and validate Expectations to further your understanding of new data. But first, you must retrieve a Batch of data to validate your Expectations against.
GX provides two methods of retrieving sample data for testing or data exploration. The first is to request a Batch of data from any Batch Definition you have previously configured. The second is to use the built in pandas_default
Data Source to read in a Batch of data from a datafile such as a .csv
or .parquet
file without first defining a corresponding Data Source, Data Asset, and Batch Definition.
- Batch Definition
- pandas_default
Batch Definitions both organize a Data Asset's records into Batches and provide a method for retrieving those records. Any Batch Definition can be used to retrieve a Batch of records for use in testing Expectations or data exploration.
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context. - A preconfigured Data Source, Data Asset, and Batch Definition connected to your data.
Procedure
- Instructions
- Sample code
-
Retrieve your Batch Definition.
Update the values of
data_source_name
,data_asset_name
, andbatch_definition_name
in the following code and execute it to retrieve your Batch Definition from the Data Context:Pythondata_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
) -
Optional. Specify the Batch to retrieve.
Some Batch Definitions can only provide a single Batch. Whole table batch definitions on SQL Data Assets, file path and whole directory Batch Definitions on filesystem Data Assets, and all Batch Definitions for dataframe Data Assets will provide all of the Data Asset's records as a single Batch. For these Batch Definitions there is no need to specify which Batch to retrieve because there is only one available.
Yearly, monthly, and daily Batch Definitions subdivide the Data Asset's records by date. This allows you to retrieve the data corresponding to a specific date from the Data Asset. If you do not specify a Batch to retrieve, these Batch Definitions will return the first valid Batch they find. By default, this will be the most recent Batch (sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.
Sorting of records with invalid datesRecords that are missing the date information necessary to be sorted into a Batch will be treated as the "oldest" records and will be returned first when a Batch Definition is set to sort descending.
You are not limited to retrieving only the most recent (or oldest, if the Batch Definition is set to sort descending) Batch. You can also request a specific Batch by providing a Batch Parameter dictionary.
The Batch Parameter dictionary is a dictionary with keys indicating the
year
,month
, andday
of the data to retrieve and with values corresponding to those date components.Which keys are valid Batch Parameters depends on the type of date the Batch Definition is configured for:
- Yearly Batch Definition accept the key
year
. - Monthly Batch Definition accept the keys
year
andmonth
. - Daily Batch Definition accept the keys
year
,month
, andday
.
If a Batch Definition is missing a key, the returned Batch will be the first Batch (as determined by the Batch Definition's sort ascending or sort descending configuration) that matches the date components that were provided.
The following are some sample Batch Parameter dictionaries for progressively more specific dates:
Pythonyearly_batch_parameters = {"year": "2019"}
monthly_batch_parameters = {"year": "2019", "month": "01"}
daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"} - Yearly Batch Definition accept the key
-
Retrieve a Batch of data.
The Batch Definition's
.get_batch(...)
method is used to retrieve a Batch of Data. The Batch Parameters provided to this method will determine if the first valid Batch is returned, or a Batch for a specific date is returned.- First valid Batch
- Specific Batch
Execute the following code to retrieve the first available Batch from the Batch Definition:
Pythonbatch = batch_definition.get_batch()
Update the Batch Parameters in the following code and execute it to retrieve a specific Batch from the Batch Definition:
Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
-
Optional. Verify that the returned Batch is populated with records.
You can verify that your Batch Definition was able to read in data and return a populated Batch by printing the header and first few records of the returned Batch:
Pythonprint(batch.head())
import great_expectations as gx
context = gx.get_context()
# Retrieve the Batch Definition:
data_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
)
# Retrieve the first valid Batch of data:
batch = batch_definition.get_batch()
# Or use a Batch Parameter dictionary to specify a Batch to retrieve
# These are sample Batch Parameter dictionaries:
yearly_batch_parameters = {"year": "2019"}
monthly_batch_parameters = {"year": "2019", "month": "01"}
daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"}
# This code retrieves the Batch from a monthly Batch Definition:
batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
print(batch.head())
The pandas_default
Data Source is built into every Data Context and can be found at .data_sources.pandas_default
on your Data Context.
The pandas_default
Data Source provides methods to read the contents of a single datafile in any format supported by pandas. These .read_*(...)
methods do not create a Data Asset or Batch Definition for the datafile. Instead, they simply return a Batch of data.
Because the pandas_default
Data Source's .read_*(...)
methods only return a Batch and do not save configurations for reading files to the Data Context, they are less versatile than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default
Data Source is only intended to facilitate testing Expectations and engaging in data exploration. The pandas_default
Data Source's .read_*(...)
methods are less suited for use in production and automated workflows.
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context. - Data in a file format supported by pandas, such as
.csv
or.parquet
.
Procedure
- Instructions
- Sample code
-
Define the path to the datafile.
The simplest method is to provide an absolute path to the datafile that you will retrieve records from. However, if you are using a File Data Context you can also provide a path relative to the Data Context's
base_directory
.The following example specifies a
.csv
datafile using a relative path:Pythonfile_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
-
Use the appropriate
.read_*(...)
method of thepandas_default
Data Source to retrieve a Batch of data.The
pandas_default
Data Source can read any file format supported by your current installation of pandas.The
.read_*(...)
methods of thepandas_default
Data Source will return a Batch that contains all of the records in the provided datafile.The following example reads a
.csv
file into a Batch of data:Pythonsample_batch = context.data_sources.pandas_default.read_csv(file_path)
GX supports all of the pandas
.read_*(...)
methods. For more information on which Pandasread_*
methods are available, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed. -
Optional. Verify that the returned Batch is populated with records.
You can verify that your Batch Definition was able to read in data and return a populated Batch by printing the header and first few records of the returned Batch:
Pythonprint(sample_batch.head())
import great_expectations as gx
context = gx.get_context()
# Provide the path to a data file:
file_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
# Use the `pandas_default` Data Source to read the file:
sample_batch = context.data_sources.pandas_default.read_csv(file_path)
# Verify that data was read into `sample_batch`:
print(sample_batch.head())