Manage SQL Data Assets
A Data Asset is a collection of records within a Data Source that define how Great Expectations (GX) organizes data into Batches. Use the information provided here to connect GX to SQL tables and data returned by SQL database queries and learn how to organize Batches in a SQL Data Asset.
Great Expectations (GX) uses SQLAlchemy to connect to SQL Data Assets, and most of the SQL dialects supported by SQLAlchemy are also supported by GX. For more information about the SQL dialects supported by SQLAlchemy, see Dialects.
- Connect to a SQL table
- Connect to SQL data using a query
- Organize Batches
Connect to a SQL table
Connect GX to a SQL table to access Data Assets.
The following code examples use a previously defined Data Source named "my_datasource"
to connect to a SQL database.
Prerequisites
- An installation of GX set up to work with SQL
- Source data stored in a SQL database
- A SQL-based Data Source
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Retrieve a SQL Data Source
Run the following Python code to retrieve the Data Source:
datasource = context.get_datasource("my_datasource")
Add a table to the Data Source as a Data Asset
You create a Data Asset to identify the table to connect to.
Run the following Python code to define the name
and table_name
variables:
table_asset = datasource.add_table_asset(name="my_asset", table_name=my_table_name)
Add additional tables (Optional)
To connect to additional tables in the same SQL Database, repeat the previous steps to add them as table Data Assets.
Related documentation
Connect to SQL data using a query
Connect GX to the data returned by a query in a SQL database.
The following code examples use a previously defined Data Source named "my_datasource"
to connect to a SQL database.
Prerequisites
- An installation of GX set up to work with SQL.
- Source data stored in a SQL database.
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Retrieve a SQL Data Source
Run the following Python code to retrieve the Data Source:
datasource = context.get_datasource("my_datasource")
Add a query to the Data Source as a Data Asset
Run the following Python code to define a Data Asset and the name
and query
variables:
query_asset = datasource.add_query_asset(
name="my_asset",
query="SELECT passenger_count, total_amount FROM yellow_tripdata_sample_2019_01",
)
Add additional queries (Optional)
To connect to the contents of additional queries in the same SQL Database, repeat the previous steps to add them as query Data Assets.
Related documentation
Organize Batches
Organize Batches in a SQL-based Data Asset. This includes using Splitters to divide the data in a table or query based on the contents of a provided field and adding Batch Sorters to a Data Asset to specify the order in which Batches are returned.
The following code examples use a previously defined Data Source named "my_datasource"
to connect to a SQL database.
Prerequisites
- A working installation of Great Expectations
- A Data Asset in a SQL-based Data Source
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Retrieve a SQL Data Source and Data Asset
Run the following Python code to retrieve the Data Source:
my_datasource = context.get_datasource("my_datasource")
my_table_asset = my_datasource.get_asset(asset_name="my_table_asset")
Add a Splitter to the Data Asset
Run the following Python code to split the TableAsset into Batches:
my_table_asset.add_splitter_year_and_month(column_name="pickup_datetime")
Add Batch Sorters to the Data Asset (Optional)
Adding Batch Sorters to your Data Asset lets you explicitly state the order in which your Batches are returned when you request data from the Data Asset. To add Batch Sorters, pass a list of sorters to the add_sorters(...)
method of your Data Asset.
Run the following Python code to split the "pickup_datetime"
column on "year"
and "month"
, so your list of sorters can have up to two elements. The code also adds an ascending sorter based on the contents of the splitter group "year"
and a descending sorter based on the contents of the splitter group "month"
:
my_asset = my_table_asset.add_sorters(["+year", "-month"])
Use a Batch Request to verify Data Asset functionality
Run the following Python code to verify that your Data Asset returns the necessary files as Batches:
my_batch_request = my_table_asset.build_batch_request()
batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request)
A Batch List can contain a lot of metadata. To verify which files were included in the returned Batches, run the following Python code to review the batch_spec
for each returned Batch:
for batch in batches:
print(batch.batch_spec)