Use SQL to define a custom Expectation
Among the available Expectations, the UnexpectedRowsExpectation
is designed to facilitate the execution of SQL or Spark-SQL queries as the core logic for an Expectation. By default, UnexpectedRowsExpectation
considers validation successful when no rows are returned by the provided SQL query.
Like any other Expectation, you can instantiate the UnexpectedRowsExpectation
directly. You can also customize an UnexpectedRowsExpectation
in essentially the same manner as you would define a custom Expectation, by subclassing UnexpectedRowsExpectation
and providing customized default attributes and text for Data Docs. However, there are some caveats around the UnexpectedRowsExpectation
's unexpected_rows_query
attribute that deserve further detail.
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context.
- Recommended. A preconfigured Data Source and Data Asset connected to your data for testing your customized Expectation.
Procedure
- Instructions
- Sample code
-
Create a new Expectation class that inherits the
UnexpectedRowsExpectation
class.The class name
UnexpectedRowsExpectation
describes the functionality of the Expectation: it finds rows with unexpected values. When you create a customized Expectation class you can provide a class name that is more indicative of your specific use case. In this example, the customized subclass ofUnexpectedRowsExpectation
will be used to find invalid passenger counts in taxi trip data:Pythonclass ExpectPassengerCountToBeLegal(gx.expectations.UnexpectedRowsExpectation):
-
Override the Expectation's
unexpected_rows_query
attribute.The
unexpected_rows_query
attribute is a SQL or Spark-SQL query that returns a selection of rows from the Batch of data being validated. By default, rows that are returned have failed the validation check.The
unexpected_rows_query
should be written in standard SQL or Spark-SQL syntax, except that it can also contain the special{batch}
named query. When the Expectation is evaluated, the{batch}
keyword will be replaced with the Batch of data that is configured for your Data Asset.In this example,
unexpected_rows_query
will select any rows where the passenger count is greater than6
or less than0
. These rows will fail validation for this Expectation:Pythonclass ExpectPassengerCountToBeLegal(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = (
"SELECT * FROM {batch} WHERE passenger_count > 6 or passenger_count < 0"
) -
Customize the rendering of the new Expectation when displayed in Data Docs.
As with other Expectations, the
description
attribute contains the text describing the customized Expectation when your results are rendered into Data Docs. It can be set when an Expectation class is defined or edited as an attribute of an Expectation instance. You can format thedescription
string with Markdown syntax:Pythonclass ExpectPassengerCountToBeLegal(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = (
"SELECT * FROM {batch} WHERE passenger_count > 6 or passenger_count < 0"
)
description: str = "There should be no more than **6** passengers." -
Use the customized subclass as an Expectation.
Once the customized Expectation subclass has been defined, instances of it can be created, added to Expectation Suites, and validated just like any other Expectation class:
Pythonexpectation = ExpectPassengerCountToBeLegal()
import great_expectations as gx
# Define a custom Expectation that uses SQL by subclassing UnexpectedRowsExpectation
class ExpectPassengerCountToBeLegal(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = (
"SELECT * FROM {batch} WHERE passenger_count > 6 or passenger_count < 0"
)
description: str = "There should be no more than **6** passengers."
context = gx.get_context()
# Instantiate the custom Expectation
expectation = ExpectPassengerCountToBeLegal()
# Test the Expectation
data_source_name = "my_sql_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
.get_batch()
)
batch.validate(expectation)