Create a Custom Batch Expectation
BatchExpectations
are one of the most common types of ExpectationA verifiable assertion about data..
They are evaluated for an entire Batch, and answer a semantic question about the Batch itself. For example, expect_table_column_count_to_equal
and expect_table_row_count_to_equal
answer how many columns and rows are in your Batch.
This guide will walk you through the process of creating your own custom BatchExpectation
.
Prerequisites
Choose a name for your Expectation
First, decide on a name for your own Expectation. By convention, BatchExpectations
always start with expect_table_
.
For more on Expectation naming conventions, see the Expectations section of the Code Style Guide.
Your Expectation will have two versions of the same name: a CamelCaseName
and a snake_case_name
. For example, this tutorial will use:
ExpectBatchColumnsToBeUnique
expect_batch_columns_to_be_unique
Copy and rename the template file
By convention, each Expectation is kept in its own python file, named with the snake_case version of the Expectation's name.
Download the custom BatchExpectation template and then run the following code to rename it and save it to a directory:
cp batch_expectation_template.py /SOME_DIRECTORY/expect_batch_columns_to_be_unique.py
Store Expectation files
During development, you don't need to store Expectation files in a specific location. Expectation files are self-contained and can be executed anywhere as long as GX is installed However, to use your new Expectation with other GX components, you'll need to make sure the file is stored one of the following locations:
-
If you're building a Custom ExpectationAn extension of the `Expectation` class, developed outside of the Great Expectations library. for personal use, you'll need to put it in the
great_expectations/plugins/expectations
folder of your GX deployment, and import your Custom Expectation from that directory whenever it will be used. When you instantiate the correspondingDataContext
, it will automatically make all PluginsExtends Great Expectations' components and/or functionality. in the directory available for use. -
If you're building a Custom Expectation to contribute to the open source project, you'll need to put it in the repo for the Great Expectations library itself. Most likely, this will be within a package within
contrib/
:great_expectations/contrib/SOME_PACKAGE/SOME_PACKAGE/expectations/
. To use these Expectations, you'll need to install the package.
For more information about Custom Expectations, see Use a Custom Expectation.
Generate a diagnostic checklist for your Expectation
Once you've copied and renamed the template file, you can execute it as follows.
python expect_batch_columns_to_be_unique.py
The template file is set up so that this will run the Expectation's print_diagnostic_checklist()
method. This will run a diagnostic script on your new Expectation, and return a checklist of steps to get it to full production readiness.
This guide will walk you through the first five steps, the minimum for a functioning Custom Expectation and all that is required for contribution back to open source at an Experimental level.
Completeness checklist for ExpectColumnAggregateToMatchSomeCriteria:
✔ Has a valid library_metadata object
Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...
When in doubt, the next step to implement is the first one that doesn't have a ✔ next to it. This guide covers the first five steps on the checklist.
Change the Expectation class name and add a docstring
By convention, your MetricA computed attribute of data such as the mean of a column. class is defined first in a Custom Expectation. For now, we're going to skip to the Expectation class and begin laying the groundwork for the functionality of your Custom Expectation.
Let's start by updating your Expectation's name and docstring.
Replace the Expectation class name
class ExpectBatchToMeetSomeCriteria(BatchExpectation):
with your real Expectation class name, in upper camel case:
class ExpectBatchColumnsToBeUnique(BatchExpectation):
You can also go ahead and write a new one-line docstring, replacing
"""TODO: add a docstring here"""
with something like:
"""Expect batch to contain columns with unique contents."""
Make sure your one-line docstring begins with "Expect " and ends with a period. You'll also need to change the class name at the bottom of the file, by replacing this line:
ExpectBatchToMeetSomeCriteria().print_diagnostic_checklist()
with this one:
ExpectBatchColumnsToBeUnique().print_diagnostic_checklist()
Later, you can go back and write a more thorough docstring. See Expectation Docstring Formatting.
At this point you can re-run your diagnostic checklist. You should see something like this:
$ python expect_batch_columns_to_be_unique.py
Completeness checklist for ExpectBatchColumnsToBeUnique:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...
Congratulations! You're one step closer to implementing a Custom Expectation.
Add example cases
You're going to search for examples = []
in your file, and replace it with at least two test examples. These examples serve the following purposes:
-
They provide test fixtures that Great Expectations can execute automatically with pytest.
-
They help users understand the logic of your Expectation by providing tidy examples of paired input and output. If you contribute your Expectation to open source, these examples will appear in the Gallery.
Your examples will look similar to this example:
examples = [
{
"dataset_name": "expect_batch_columns_to_be_unique_1",
"data": {
"col1": [1, 2, 3, 4, 5],
"col2": [2, 3, 4, 5, 6],
"col3": [3, 4, 5, 6, 7],
},
"tests": [
{
"title": "strict_positive_test",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"strict": True},
"out": {"success": True},
}
],
},
{
"dataset_name": "expect_batch_columns_to_be_unique_2",
"data": {
"col1": [1, 2, 3, 4, 5],
"col2": [1, 2, 3, 4, 5],
"col3": [3, 4, 5, 6, 7],
},
"tests": [
{
"title": "loose_positive_test",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"strict": False},
"out": {"success": True},
},
{
"title": "strict_negative_test",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"strict": True},
"out": {"success": False},
},
],
},
]
Here's a quick overview of how to create test cases to populate examples
. The overall structure is a list of dictionaries. Each dictionary has two keys:
data
: defines the input data of the example as a Batch. In these examples the Batch has three columns (col1
,col2
andcol3
). These columns have 5 rows. (Note: if you define multiple columns, make sure that they have the same number of rows.)tests
: a list of test cases to validate against the data frame defined in the correspondingdata
.title
should be a descriptive name for the test case. Make sure to have no spaces.include_in_gallery
: This must be set toTrue
if you want this test case to be visible in the Gallery as an example.in
contains exactly the parameters that you want to pass in to the Expectation."in": {"strict": True}
in the example above is equivalent toexpect_batch_columns_to_be_unique(strict=True)
out
is based on the Validation Result returned when executing the Expectation.exact_match_out
: if you setexact_match_out=False
, then you don’t need to include all the elements of the Validation Result object - only the ones that are important to test.
only_for
(optional): the list of backends that the Expectation should use for testingsuppress_test_for
(optional): the list of backends that the Expectation should not use for testingonly_for
andsuppres_test_for
can be specified at the top-level (next todata
andtests
) or within specific tests (next totitle
, and so on)
If you run your Expectation file again, you won't see any new checkmarks, as the logic for your Custom Expectation hasn't been implemented yet. However, you should see that the tests you've written are now being caught and reported in your checklist:
$ python expect_batch_columns_to_be_unique.py
Completeness checklist for ExpectBatchColumnsToBeUnique:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
...
Has core logic that passes tests for all applicable Execution Engines and SQL dialects
Only 0 / 2 tests for pandas are passing
Failing: basic_positive_test, basic_negative_test
...
For more information on tests and example cases,
see our guide on creating example cases for a Custom Expectation.
Implement your Metric and connect it to your Expectation
This is the stage where you implement the actual business logic for your Expectation. To do so, you'll need to implement a function within a MetricA computed attribute of data such as the mean of a column. class, and link it to your Expectation. By the time your Expectation is complete, your Metric will have functions for all three Execution Engines (Pandas, Spark, and SQLAlchemy) supported by Great Expectations. For now, we're only going to define one.
Metrics answer questions about your data posed by your Expectation,
and allow your Expectation to judge whether your data meets your expectations.
Your Metric function will have the @metric_value
decorator, with the appropriate engine
. Metric functions can be as complex as you like, but they're often very short. For example, here's the definition for a Metric function to find the unique columns of a Batch with the PandasExecutionEngine.
@metric_value(engine=PandasExecutionEngine)
def _pandas(
cls,
execution_engine,
metric_domain_kwargs,
metric_value_kwargs,
metrics,
runtime_configuration,
):
df, _, _ = execution_engine.get_compute_domain(
metric_domain_kwargs, domain_type=MetricDomainTypes.TABLE
)
unique_columns = set(df.T.drop_duplicates().T.columns)
return unique_columns
The @metric_value
decorator allows us to explicitly structure queries and directly access our compute domain.
While this can result in extra roundtrips to your database in some situations, it allows for advanced functionality and customization of your Custom Expectations.
This is all that you need to define for now. In the next step, we will implement the method to validate the result of this Metric.
Other parameters
Expectation Success Keys - A tuple consisting of values that must / could be provided by the user and defines how the Expectation evaluates success.
Expectation Default Kwarg Values (Optional) - Default values for success keys and the defined domain, among other values.
Metric Condition Value Keys (Optional) - Contains any additional arguments passed as parameters to compute the Metric.
Next, choose a Metric Identifier for your Metric. By convention, Metric Identifiers for Column Map Expectations start with column.
.
The remainder of the Metric Identifier simply describes what the Metric computes, in snake case. For this example, we'll use column.custom_max
.
You'll need to substitute this metric into two places in the code. First, in the Metric class, replace
metric_name = "METRIC NAME GOES HERE"
with
metric_name = "table.columns.unique"
Second, in the Expectation class, replace
metric_dependencies = ("METRIC NAME GOES HERE",)
with
metric_dependencies = ("table.columns.unique", "table.columns")
It's essential to make sure to use matching Metric Identifier strings across your Metric class and Expectation class. This is how the Expectation knows which Metric to use for its internal logic.
Finally, rename the Metric class name itself, using the camel case version of the Metric Identifier, minus any periods.
For example, replace:
class BatchMeetsSomeCriteria(TableMetricProvider):
with
class BatchColumnsUnique(TableMetricProvider):
Validate
In this step, we simply need to validate that the results of our Metrics meet our Expectation.
The validate method is implemented as _validate(...)
:
def _validate(
self,
metrics: Dict,
runtime_configuration: Optional[dict] = None,
execution_engine: ExecutionEngine = None,
):
This method takes a dictionary named metrics
, which contains all Metrics requested by your Metric dependencies,
and performs a simple validation against your success keys (i.e. important thresholds) in order to return a dictionary indicating whether the Expectation has evaluated successfully or not.
To do so, we'll be accessing our success keys, as well as the result of our previously-calculated Metrics.
For example, here is the definition of a _validate(...)
method to validate the results of our table.columns.unique
Metric against our success keys:
def _validate(
self,
metrics: Dict,
runtime_configuration: dict | None = None,
execution_engine: ExecutionEngine | None = None,
):
unique_columns = metrics.get("table.columns.unique")
batch_columns = metrics.get("table.columns")
strict = self.configuration.kwargs.get("strict")
duplicate_columns = unique_columns.symmetric_difference(batch_columns)
if strict is True:
success = len(duplicate_columns) == 0
else:
success = len(duplicate_columns) < len(batch_columns)
return {
"success": success,
"result": {"observed_value": {"duplicate_columns": duplicate_columns}},
}
Running your diagnostic checklist at this point should return something like this:
$ python expect_batch_columns_to_be_unique.py
Completeness checklist for ExpectBatchColumnsToBeUnique:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...
Linting
Finally, we need to lint our now-functioning Custom Expectation. Our CI system will test your code using black
, and ruff
.
If you've set up your dev environment, these libraries will already be available to you, and can be invoked from your command line to automatically lint your code:
black <PATH/TO/YOUR/EXPECTATION.py>
ruff <PATH/TO/YOUR/EXPECTATION.py> --fix
If desired, you can automate this to happen at commit time. See our guidance on linting for more on this process.
Once this is done, running your diagnostic checklist should now reflect your Custom Expectation as meeting our linting requirements:
$ python expect_batch_columns_to_be_unique.py
Completeness checklist for ExpectBatchColumnsToBeUnique:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
✔ Passes all linting checks
...
Contribute (Optional)
This guide will leave you with a Custom Expectation sufficient for contribution to Great Expectations at an Experimental level.
If you plan to contribute your Expectation to the public open source project, you should update the library_metadata
object before submitting your Pull Request. For example:
library_metadata = {
"tags": [], # Tags for this Expectation in the Gallery
"contributors": [ # Github handles for all contributors to this Expectation.
"@your_name_here", # Don't forget to add your github handle here!
],
}
would become
library_metadata = {
"tags": ["uniqueness"],
"contributors": ["@joegargery"],
}
This is particularly important because we want to make sure that you get credit for all your hard work!
For more information on our code standards and contribution, see our guide on Levels of Maturity for Expectations.
To view the full script used in this page, see it on GitHub: