How to create custom Expectations for pandas¶
Custom Expectations let you extend the logic for validating data to use any criteria you choose. This guide will show you how to extend the PandasDataset
class with your own Expectations.
Prerequisites: This how-to guide assumes you have already:
Launched a generic notebook (e.g.
jupyter notebook
,jupyter lab
, etc.)Obtained data that can be accessed from your notebook
Steps¶
Import great_expectations and PandasDataset and MetaPandasDataset
import great_expectations as ge from great_expectations.dataset import ( PandasDataset, MetaPandasDataset, )
PandasDataset
is the parent class used for executing Expectations on pandas Dataframes. Most of the core Expectations are built using decorators defined inMetaPandasDataset
. These decorators greatly streamline the task of extending Great Expectations with custom Expectation logic.Define a class inheriting from PandasDataset
class MyCustomPandasDataset(PandasDataset): _data_asset_type = "MyCustomPandasDataset"
Setting the
_data_asset_type
is not strictly necessary, but can be helpful for tracking the lineage of instantiated Expectations and Validation Results.Within your new class, define Expectations using decorators from MetaPandasDataset
column_map_expectations
are Expectations that are applied to a single column, on a row-by-row basis. To learn about other Expectation types, please see Other Expectation decorators below.The
@MetaPandasDataset.column_map_expectation
decorator wraps your custom function with all the business logic required to turn it into a fully-fledged Expectation. This spares you the hassle of defining logic to handle required arguments likemostly
andresult_format
. Your custom function can focus exclusively on the business logic of passing or failing the Expectation.In the simplest case, they could be as simple as one-line lambda functions.
@MetaPandasDataset.column_map_expectation def expect_column_values_to_be_even(self, column): return column.map(lambda x: x%2==0)
To use the
column_map_expectation
decorator, your custom function must accept at least two arguments:self
andcolumn
. When the user invokes your Expectation, they will pass a string containing the column name. The decorator will then fetch the appropriate column and pass all of the non-null values to your function as a pandasSeries
. Your function must then return a Series of boolean values in the same order, with the same index.Custom functions can also accept additional arguments:
@MetaPandasDataset.column_map_expectation def expect_column_values_to_be_less_than(self, column, value): return column.map(lambda x: x<value)
Custom functions can have complex internal logic:
@MetaPandasDataset.column_map_expectation def expect_column_value_word_counts_to_be_between(self, column, min_value=None, max_value=None): def count_words(string): word_list = re.findall("(\S+)", string) return len(word_list) word_counts = column.map(lambda x: count_words(str(x))) if min_value != None and max_value != None: return word_counts.map(lambda x: min_value <= x <= max_value) elif min_value != None and max_value == None: return word_counts.map(lambda x: min_value <= x) elif min_value == None and max_value != None: return word_counts.map(lambda x: x <= max_value) else: return word_counts.map(lambda x: True)
Custom functions can reference external modules and methods:
import pytz @MetaPandasDataset.column_map_expectation def expect_column_values_to_be_valid_timezones(self, column, timezone_values=pytz.all_timezones): return column.map(lambda x: x in timezone_values)
By convention,
column_map_expectations
always start withexpect_column_values_...
orexpect_column_value_...
(Ex:expect_column_value_word_counts_to_be_between
). Following this pattern is highly recommended, but not strictly required. If you want to confuse yourself with bad names, the package won’t stop you.Load some data
To make your new Expectations available for validation, you can instantiate a
MyCustomPandasDataset
as follows:my_df = ge.read_csv("./data/Titanic.csv", dataset_class=MyCustomPandasDataset)
You can also coerce an existing pandas DataFrame to your class using
from_pandas
:my_pd_df = pd.read_csv("./data/Titanic.csv") my_df = ge.from_pandas(my_pd_df, dataset_class=MyCustomPandasDataset)
As a third option:
my_pd_df = pd.read_csv("./data/Titanic.csv") my_df = MyCustomPandasDataset(my_pd_df)
Note: We’re using the
read_csv
method to fetch data, instead of the more typicalDataContext.get_batch
. This is for convenience: it allows us to handle the full development loop for a custom Expectation within a notebook with a minimum of configuration.In a moment, we’ll demonstrate how to configure a Datasource to use
MyCustomPandasDataset
when callingget_batch
.Test your Expectations
At this point, you can test your new Expectations exactly like built-in Expectations. All out-of-the-box Expectations will still be available, plus your new methods.
my_df.expect_column_values_to_be_even("Survived")
returns
{ "success": false, "meta": {}, "result": { "element_count": 1313, "missing_count": 0, "missing_percent": 0.0, "unexpected_count": 450, "unexpected_percent": 34.27265803503427, "unexpected_percent_nonmissing": 34.27265803503427, "partial_unexpected_list": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] }, "exception_info": null }
As mentioned previously, the
column_map_expectation
decorator extends the arguments to include other arguments, likemostly
. Please see the module documentation for full details.my_df.expect_column_values_to_be_even("Survived", mostly=.6)
returns
{ "success": true, "meta": {}, "result": { "element_count": 1313, "missing_count": 0, "missing_percent": 0.0, "unexpected_count": 450, "unexpected_percent": 34.27265803503427, "unexpected_percent_nonmissing": 34.27265803503427, "partial_unexpected_list": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] }, "exception_info": null }
Often, the best development loop for custom Expectations is iterative: editing Expectations in
MyCustomPandasDataset
, then re-running the cells to load data and execute Expectations on data.At this point, your custom Expectations work—but only within a notebook. Next, let’s configure them to work from within a Datasource in your Data Context.
Save your MyCustomPandasDataset class to a Plugin module
The simplest way to do this is to create a new, single-file python module within your
great_expectations/plugins/
directory. Name it something likecustom_pandas_dataset.py
. Copy the full contents of yourMyCustomPandasDataset
class into this file. Make sure to include any required imports, too.When you instantiate a Data Context, Great Expectations automatically adds
plugins/
to the python namespace, so your class can be imported ascustom_pandas_dataset.MyCustomPandasDataset
.Configure your Datasource(s)
Now, open your
great_expectations.yml
file. Assuming that you’ve previously configured a pandas Datasource, you should see a configuration block similar to this, under thedatasources
key:my_data__dir: module_name: great_expectations.datasource class_name: PandasDatasource data_asset_type: module_name: great_expectations.dataset class_name: PandasDataset batch_kwargs_generators: subdir_reader: class_name: SubdirReaderBatchKwargsGenerator base_directory: ../my_data
In the
data_asset_type
section, replacemodule_name
andclass_name
with names for your module and class:data_asset_type: module_name: custom_pandas_dataset class_name: MyCustomPandasDataset
Now, any time you load data through the
my_data__dir
Datasource, it will be loaded as aMyCustomPandasDataset
, with all of your new Expectations available.If you have other
PandasDatasources
in your configuration, you may want to switch them to use your newdata_asset_type
, too.Test loading a new Batch through the DataContext
You can test this configuration as follows:
context = ge.DataContext() context.create_expectation_suite("my_new_suite") my_batch = context.get_batch({ "path": "my_data/Titanic.csv", "datasource": "my_data__dir" }, "my_new_suite") my_batch.expect_column_values_to_be_even("Age")
Executing this Expectation should return something like:
{ "result": { "element_count": 1313, "missing_count": 557, "missing_percent": 42.421934501142424, "unexpected_count": 344, "unexpected_percent": 26.199543031226202, "unexpected_percent_nonmissing": 45.5026455026455, "partial_unexpected_list": [ 29.0, 25.0, 0.92, ... 59.0, 45.0 ] }, "success": false, "meta": {}, "exception_info": null }
Additional notes¶
Other Expectation decorators¶
Aside from column_map_expectations
, there are several other types of Expectations you can create. Please see the module docs for MetaPandasDataset
for details.
Additional resources¶
Here’s a single code block containing all the notebook code in this article:
import re
import pytz
import great_expectations as ge
from great_expectations.dataset import (
PandasDataset,
MetaPandasDataset,
)
class MyCustomPandasDataset(PandasDataset):
_data_asset_type = "MyCustomPandasDataset"
@MetaPandasDataset.column_map_expectation
def expect_column_values_to_be_even(self, column):
return column.map(lambda x: x%2==0)
@MetaPandasDataset.column_map_expectation
def expect_column_values_to_be_less_than(self, column, value):
return column.map(lambda x: x < value)
@MetaPandasDataset.column_map_expectation
def expect_column_value_word_counts_to_be_between(self, column, min_value=None, max_value=None):
def count_words(string):
word_list = re.findall("(\S+)", string)
return len(word_list)
word_counts = column.map(lambda x: count_words(str(x)))
if min_value != None and max_value != None:
return word_counts.map(lambda x: min_value <= x <= max_value)
elif min_value != None and max_value == None:
return word_counts.map(lambda x: min_value <= x)
elif min_value == None and max_value != None:
return word_counts.map(lambda x: x <= max_value)
else:
return word_counts.map(lambda x: True)
@MetaPandasDataset.column_map_expectation
def expect_column_values_to_be_valid_timezones(self, column, timezone_values=pytz.all_timezones):
return column.map(lambda x: x in timezone_values)
#Instantiate the class in several different ways
my_df = ge.read_csv("my_data/Titanic.csv", dataset_class=MyCustomPandasDataset)
my_other_df = pd.read_csv("my_data/Titanic.csv")
ge.from_pandas(my_other_df, dataset_class=MyCustomPandasDataset)
my_other_df = ge.read_csv("my_data/Titanic.csv")
ge.from_pandas(my_other_df, dataset_class=MyCustomPandasDataset)
# Run Expectations in assertions so that they can be used as tests for this guide
assert my_df.expect_column_values_to_be_in_set("Sex", value_set=["Male", "Female"]).success == False
assert my_df.expect_column_values_to_be_even("Survived").success == False
assert my_df.expect_column_values_to_be_even("Survived", mostly=.6).success == True
assert my_df.expect_column_value_word_counts_to_be_between("Name", 3, 5).success == False
assert my_df.expect_column_value_word_counts_to_be_between("Name", 3, 5, mostly=.9).success == True
assert my_df.expect_column_values_to_be_valid_timezones("Name", mostly=.9).success == False