Version: 1.2.4

Connect to dataframe data

A dataframe is a set of data that resides in-memory and is represented in your code by a variable to which it is assigned. To connect to this in-memory data you will define a Data Source based on the type of dataframe you are connecting to, a Data Asset that connects to the dataframe in question, and a Batch Definition that will return all of the records in the dataframe as a single Batch of data.

Create a Data Source

Because the dataframes reside in memory you do not need to specify the location of the data when you create your Data Source. Instead, the type of Data Source you create depends on the type of dataframe containing your data. Great Expectations has methods for connecting to both pandas and Spark dataframes.

Prerequisites

Python version 3.9 to 3.12
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.

Procedure

Instructions
Sample code

Define the Data Source parameters.

A dataframe Data Source requires the following information:
- name: A name by which to reference the Data Source. This should be unique among all Data Sources on the Data Context.
Update data_source_name in the following code with a descriptive name for your Data Source:
Python
```
data_source_name = "my_data_source"
```
Create the Data Source.

To read a pandas dataframe you will need to create a pandas Data Source. Likewise, to read a Spark dataframe you will need to create a Spark Data Source.
- pandas
- Spark
Execute the following code to create a pandas Data Source:
Python
data_source = context.data_sources.add_pandas(name=data_source_name) assert data_source.name == data_source_name
Execute the following code to create a Spark Data Source:
Python
data_source = context.data_sources.add_spark(name=data_source_name) assert data_source.name == data_source_name

pandas
Spark

Python
import great_expectations as gx

# Retrieve your Data Context
context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"

# Define the Data Source name
data_source_name = "my_data_source"

# Add the Data Source to the Data Context
data_source = context.data_sources.add_pandas(name=data_source_name)
assert data_source.name == data_source_name

Python
import great_expectations as gx

# Retrieve your Data Context
context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"

# Define the Data Source name
data_source_name = "my_data_source"

# Add the Data Source to the Data Context
data_source = context.data_sources.add_spark(name=data_source_name)
assert data_source.name == data_source_name

Create a Data Asset

A dataframe Data Asset is used to group your Validation Results. For instance, if you have a data pipeline with three stages and you wanted the Validation Results for each stage to be grouped together, you would create a Data Asset with a unique name representing each stage.

Prerequisites

Python version 3.9 to 3.12
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.
A pandas or Spark dataframe Data Source.

Procedure

Instructions
Sample code

Optional. Retrieve your Data Source.

If you do not already have a variable referencing your pandas or Spark Data Source, you can retrieve a previously created one with:
Python
```
data_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name)
```
Define the Data Asset's parameters.

A dataframe Data Asset requires the following information:
- name: A name by which the Data Asset can be referenced. This should be unique among Data Assets on the Data Source.
Update the data_asset_name parameter in the following code with a descriptive name for your Data Asset:
Python
```
data_asset_name = "my_dataframe_data_asset"
```
Add a Data Asset to the Data Source.

Execute the following code to add a Data Asset to your Data Source:
```
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
```

Python
import great_expectations as gx

context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"
# SETUP FOR THE EXAMPLE:
data_source = context.data_sources.add_pandas(name="my_data_source")

# Retrieve the Data Source
data_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name)

# Define the Data Asset name
data_asset_name = "my_dataframe_data_asset"

# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

assert data_asset.name == data_asset_name

Create a Batch Definition

Typically, a Batch Definition is used to describe how the data within a Data Asset should be retrieved. With dataframes, all of the data in a given dataframe will always be retrieved as a Batch.

This means that Batch Definitions for dataframe Data Assets don't work to subdivide the data returned for validation. Instead, they serve as an additional layer of organization and allow you to further group your Validation Results. For example, if you have already used your dataframe Data Assets to group your Validation Results by pipeline stage, you could use two Batch Definitions to further group those results by having all automated validations use one Batch Definition and all manually executed validations use the other.

Prerequisites

Python version 3.9 to 3.12
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.
A pandas or Spark dataframe Data Asset.

Procedure

Instructions
Sample code

Optional. Retrieve your Data Asset.

If you do not already have a variable referencing your pandas or Spark Data Asset, you can retrieve a previously created Data Asset with:
Python
```
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
```
Define the Batch Definition's parameters.

A dataframe Batch Definition requires the following information:
- name: A name by which the Batch Definition can be referenced. This should be unique among Batch Definitions on the Data Asset.
Because dataframes are always provided in their entirety, dataframe Batch Definitions always use the add_batch_definition_whole_dataframe() method.

Update the value of batch_definition_name in the following code with something that describes your dataframe:
Python
```
batch_definition_name = "my_batch_definition"
```
Add the Batch Definition to the Data Asset.

Execute the following code to add a Batch Definition to your Data Asset:
Python
```
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)
```

Python
import great_expectations as gx

context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"
# SETUP FOR THE EXAMPLE:
data_source = context.data_sources.add_pandas(name="my_data_source")
data_asset = data_source.add_dataframe_asset(name="my_dataframe_data_asset")

# Retrieve the Data Asset
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

# Define the Batch Definition name
batch_definition_name = "my_batch_definition"

# Add a Batch Definition to the Data Asset
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)
assert batch_definition.name == batch_definition_name

Provide a dataframe through Batch Parameters

Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.

Prerequisites

Python version 3.9 to 3.12
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.
A Batch Definition on a pandas or Spark dataframe Data Asset.
Data in a pandas or Spark dataframe. These examples assume the variable dataframe contains your pandas or Spark dataframe.
Optional. A Validation Definition.

Procedure

Define the Batch Parameter dictionary.

A dataframe can be added to a Batch Parameter dictionary by defining it as the value of the dictionary key dataframe:

Python
batch_parameters = {"dataframe": dataframe}

The following examples create a dataframe by reading a .csv file and storing it in a Batch Parameter dictionary:

pandas
Spark

Python
import pandas

csv_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
dataframe = pandas.read_csv(csv_path)

batch_parameters = {"dataframe": dataframe}

Python
from pyspark.sql import SparkSession

csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
spark = SparkSession.builder.appName("Read CSV").getOrCreate()
dataframe = spark.read.csv(csv, header=True, inferSchema=True)

batch_parameters = {"dataframe": dataframe}

Pass the Batch Parameter dictionary to a get_batch() or validate() method call.

Runtime Batch Parameters can be provided to the get_batch() method of a Batch Definition or to the validate() method of a Validation Definition.

Batch Definition
Validation Definition

The get_batch() method of a Batch Definition retrieves a single Batch of data. Runtime Batch Parameters can be provided to the get_batch() method to specify the data returned as a Batch. The validate() method of this Batch can then be used to test individual Expectations.

Python
import great_expectations as gx

context = gx.get_context()
setup_context_for_example(context)

# Retrieve the dataframe Batch Definition
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

# Create an Expectation to test
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="passenger_count", max_value=6, min_value=1
)

# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

# Test the Expectation
validation_results = batch.validate(expectation)
print(validation_results)

The results generated by batch.validate() are not persisted in storage. This workflow is solely intended for interactively creating Expectations and engaging in data Exploration.

For further information on using an individual Batch to test Expectations see Test an Expectation.

A Validation Definition's run() method validates an Expectation Suite against a Batch returned by a Batch Definition. Runtime Batch Parameters can be provided to a Validation Definition's run() method to specify the data returned in the Batch. This allows you to validate your dataframe by executing the Expectation Suite included in the Validation Definition.

Python
import great_expectations as gx

context = gx.get_context()
set_up_context_for_example(context)

# Retrieve a Validation Definition that uses the dataframe Batch Definition
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)

# Validate the dataframe by passing it to the Validation Definition as Batch Parameters.
validation_results = validation_definition.run(batch_parameters=batch_parameters)
print(validation_results)

For more information on Validation Definitions see Run Validations.

Create a Data Source​

Prerequisites​

Procedure​

Create a Data Asset​

Prerequisites​

Procedure​

Create a Batch Definition​

Prerequisites​

Procedure​

Provide a dataframe through Batch Parameters​

Prerequisites​

Procedure​

Create a Data Source

Prerequisites

Procedure

Create a Data Asset

Prerequisites

Procedure

Create a Batch Definition

Prerequisites

Procedure

Provide a dataframe through Batch Parameters

Prerequisites

Procedure