Identify failed rows in an Expectation
Quickly identifying problematic rows can reduce troubleshooting effort when an Expectation fails. After identifying the failed row, you can modify or remove the value in the table and run the Expectation again.
The following table shows the sample data that is used to demonstrate failed row identification. The sample data is webpage visitor events.
event_id | visit_id | date | event_type |
---|---|---|---|
0 | 1470408760 | 20220104 | page_load |
1 | 1470429409 | 20220104 | page_load |
2 | 1470441005 | 20220104 | page_view |
3 | 1470387700 | 20220104 | user_signup |
4 | 1470438716 | 20220104 | purchase |
5 | 1470420524 | 20220104 | download |
In the example code you'll run, you'll use a preconfigured Checkpoint named my_checkpoint
that runs an Expectation Suite named visitors_exp
that contains the Expectation type ExpectColumnValuesToBeInSet
on the event_type
column. To demonstrate an Expectation failure, you'll set the set
parameter to ["page_load", "page_view]
so the rows with user_signup
, purchase
, and download
fail.
The example code is located in the primary_keys_in_validation_results GitHub repository.
Prerequisites
- pandas
- Spark
- SQLAlchemy
pandas
Identify failed rows using pandas.
Import the Great Expectations module and instantiate a Data Context
Use the get_context()
method to create a new Data Context:
import great_expectations as gx
context = gx.get_context(project_root_dir=".")
Import the Checkpoint
Run the following code to import the example Checkpoint:
my_checkpoint = context.get_checkpoint("my_checkpoint")
Set the unexpected_index_column_names
parameter
The failed rows are defined as values in the unexpected_index_column_names
parameter. In the following example, you are setting the parameter to event_id
to return the event_ids
of the rows that fail the Expectation. However, unexpected_index_column_names
can also return a list of columns that are representative of the failed rows. The Checkpoint also returns the unexpected_index_query
, which you can use to retrieve the full set of failed results.
In the following example, you're setting the result_format
to COMPLETE
to return the full set of results. For more information about result_format
, see Result format.
result_format: dict = {
"result_format": "COMPLETE",
"unexpected_index_column_names": ["event_id"],
}
Run Checkpoint using result_format
Run the following code to apply the updated result_format
to every Expectation in your Suite and pass it directly to the run()
method:
results = my_checkpoint.run(result_format=result_format)
Review Validation Results
After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.
The following is an example of the information returned after running the Checkpoint. For pandas, a filter condition on the DataFrame is included.
{
"element_count": 6,
"unexpected_count": 3,
"unexpected_percent": 50.0,
"partial_unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_column_names": ["event_id"],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 50.0,
"unexpected_percent_nonmissing": 50.0,
"partial_unexpected_index_list": [
{"event_type": "user_signup", "event_id": 3},
{"event_type": "purchase", "event_id": 4},
{"event_type": "download", "event_id": 5},
],
"partial_unexpected_counts": [
{"value": "download", "count": 1},
{"value": "purchase", "count": 1},
{"value": "user_signup", "count": 1},
],
"unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_list": [
{"event_type": "user_signup", "event_id": 3},
{"event_type": "purchase", "event_id": 4},
{"event_type": "download", "event_id": 5},
],
"unexpected_index_query": "df.filter(items=[3, 4, 5], axis=0)",
}
Open the Data Docs
Run the following code to open the Data Docs:
context.open_data_docs()
The following image shows the unexpected_index_list
values are displayed as part of the validation output. Click To retrieve all unexpected values to open the filter condition to retrieve all unexpected values.
Spark
Identify failed rows using Spark.
Import the Great Expectations module and instantiate a Data Context
Use the get_context()
method to create a new Data Context:
import great_expectations as gx
context = gx.get_context(project_root_dir=".")
Import the Checkpoint
Run the following code to import the example Checkpoint:
my_checkpoint = context.get_checkpoint("my_checkpoint")
Set the unexpected_index_column_names
parameter
The failed rows are defined as values in the unexpected_index_column_names
parameter. In the following example, you are setting the parameter to event_id
to return the event_ids
of the rows that fail the Expectation. However, unexpected_index_column_names
can also return a list of columns that are representative of the failed rows. The Checkpoint also returns the unexpected_index_query
, which you can use to retrieve the full set of failed results.
In the following example, you're setting the result_format
to COMPLETE
to return the full set of results. For more information about result_format
, see Result format.
result_format: dict = {
"result_format": "COMPLETE",
"unexpected_index_column_names": ["event_id"],
}
Run Checkpoint using result_format
Run the following code to apply the updated result_format
to every Expectation in your Suite and pass it directly to the run()
method:
results = my_checkpoint.run(result_format=result_format)
Review Validation Results
After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.
The following is an example of the information returned after running the Checkpoint. For Spark, a filter condition on the DataFrame is included.
{
"element_count": 6,
"unexpected_count": 3,
"unexpected_percent": 50.0,
"partial_unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_column_names": ["event_id"],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 50.0,
"unexpected_percent_nonmissing": 50.0,
"partial_unexpected_index_list": [
{"event_id": 3, "event_type": "user_signup"},
{"event_id": 4, "event_type": "purchase"},
{"event_id": 5, "event_type": "download"},
],
"partial_unexpected_counts": [
{"value": "download", "count": 1},
{"value": "purchase", "count": 1},
{"value": "user_signup", "count": 1},
],
"unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_list": [
{"event_id": 3, "event_type": "user_signup"},
{"event_id": 4, "event_type": "purchase"},
{"event_id": 5, "event_type": "download"},
],
"unexpected_index_query": "df.filter(F.expr((event_type IS NOT NULL) AND (NOT (event_type IN (page_load, page_view)))))",
}
Open the Data Docs
Run the following code to open the Data Docs:
context.open_data_docs()
The following image shows the unexpected_index_list
values are displayed as part of the validation output. Click To retrieve all unexpected values to open the filter condition to retrieve all unexpected values.
SQLAlchemy
Identify failed rows using SQLAlchemy.
Import the Great Expectations module and instantiate a Data Context
Use the get_context()
method to create a new Data Context:
import great_expectations as gx
context = gx.get_context(project_root_dir=".")
Import the Checkpoint
Run the following code to import the example Checkpoint:
my_checkpoint = context.get_checkpoint("my_checkpoint")
Set the unexpected_index_column_names
parameter
The failed rows are defined as values in the unexpected_index_column_names
parameter. In the following example, you are setting the parameter to event_id
to return the event_ids
of the rows that fail the Expectation. However, unexpected_index_column_names
can also return a list of columns that are representative of the failed rows. The Checkpoint also returns the unexpected_index_query
, which you can use to retrieve the full set of failed results.
In the following example, you're setting the result_format
to COMPLETE
to return the full set of results. For more information about result_format
, see Result format.
result_format: dict = {
"result_format": "COMPLETE",
"unexpected_index_column_names": ["event_id"],
}
Run Checkpoint using result_format
Run the following code to apply the updated result_format
to every Expectation in your Suite and pass it directly to the run()
method:
results = my_checkpoint.run(result_format=result_format)
Review Validation Results
After you run the Checkpoint, a dictionary for each failed row is returned. Each dictionary contains the row’s identifier and the failed value.
The following is an example of the information returned after running the Checkpoint. For SQLAlchemy, the output includes a query that can be used directly against the database backend.
{
"element_count": 6,
"unexpected_count": 3,
"unexpected_percent": 50.0,
"partial_unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_column_names": ["event_id"],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 50.0,
"unexpected_percent_nonmissing": 50.0,
"partial_unexpected_index_list": [
{"event_id": 3, "event_type": "user_signup"},
{"event_id": 4, "event_type": "purchase"},
{"event_id": 5, "event_type": "download"},
],
"partial_unexpected_counts": [
{"value": "download", "count": 1},
{"value": "purchase", "count": 1},
{"value": "user_signup", "count": 1},
],
"unexpected_list": ["user_signup", "purchase", "download"],
"unexpected_index_list": [
{"event_id": 3, "event_type": "user_signup"},
{"event_id": 4, "event_type": "purchase"},
{"event_id": 5, "event_type": "download"},
],
"unexpected_index_query": "SELECT event_id, event_type \nFROM event_names \nWHERE event_type IS NOT NULL AND (event_type NOT IN ('page_load', 'page_view'));",
}
Open the Data Docs
Run the following code to open the Data Docs:
context.open_data_docs()
The following image shows the unexpected_index_list
values are displayed as part of the validation output. Click To retrieve all unexpected values to view the SQL query that you can use to retrieve all unexpected values.