Skip to main content
Version: 1.2.4

Apply Expectation conditions to specific rows within a Batch

By default, Expectations apply to the entire dataset retrieved in a Batch. However, there are instances when an Expectation may not be relevant for every row. Validating every row could lead to false positives or false negatives in the Validation Results.

For example, you might define an Expectation that a column indicating the country of origin for a product should not be null. If this Expectation is only applicable when the product is an import, applying it to every row in the Batch could result in many false negatives when the country of origin column is null for products produced locally.

To address this issue, GX allows you to define Expectation conditions that apply only to a subset of the data retrieved in a Batch.

Create an Expectation condition

Great Expectations allows you to specify conditions for validating rows using the row_condition argument, which can be applied to all Expectations that assess rows within a Dataset. The row_condition argument should be a string that represents a boolean expression. Rows will be validated when the row_condition expression evaluates to True. Conversely, if the row_condition evaluates to False, the corresponding row will not be validated by the Expectation.

Prerequisites

Procedure

In this procedure, it is assumed that your Data Context is stored in the variable context, and your Expectation Suite is stored in the variable suite. The suite can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.

The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.

  1. Determine the condition_parser for your row_condition.

The condition_parser defines the syntax of row_condition strings. When implementing Expectation conditions with pandas, set this argument to "pandas".

Note that the Expectation with conditions will fail if the Batch being validated is from a different type of Data Source than indicated by the condition_parser.

  1. Determine the row_condition expression.

    The row_condition argument should be a boolean expression string that is evaluated for each row in the Batch that the Expectation validates. If the row_condition evaluates to True, the row will be included in the Expectation's validations. If it evaluates to False, the Expectation will be skipped for that row.

    The syntax of the row_condition argument is based on the condition_parser specified earlier.

  2. Create the Expectation.

    An Expectation with conditions is created like a regular Expectation, with the addition of the row_condition and condition_parser parameters alongside the Expectation's other arguments.

    In pandas, the row_condition value is passed to pandas.DataFrame.query() prior to Expectation Validation, and the resulting rows from the evaluated Batch will undergo validation by the Expectation.

    Python
    condition_parser="pandas",
    row_condition='PClass=="1st"',

    Do not use single quotes, newlines, or \n in the specified row_condition as shown in the following examples:

    Python
    row_condition = "PClass=='1st'"  # Don't do this. Single quotes aren't valid!

    row_condition="""
    PClass=="1st"
    """ # Don't do this. Newlines and \n aren't valid!

    row_condition = 'PClass=="1st"' # Do this instead.

    In pandas, you can reference variables from the environment by prefixing them with @. Additionally, when a column name contains spaces, you can specify it by enclosing the name in backticks: `.

    Some examples of valid row_condition values for pandas include:

    Python
    row_condition = '`foo foo`=="bar bar"'  # The value of the column "foo foo" is "bar bar"

    row_condition = 'foo==@bar' # the value of the foo field is equal to the value of the bar environment variable

    For more information on the syntax accepted by pandas row_condition values see pandas.DataFrame.query.

  3. Optional. Create additional Expectation conditions

    Expectations that have different conditions are treated as unique, even if they belong to the same type and apply to the same column within an Expectation Suite. This approach allows you to create one unconditional Expectation and an unlimited number of Conditional Expectations, each with a distinct condition.

    For instance, the following code establishes an Expectation that the value in the "Survived" column is either 0 or 1:

    Python
    expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="Survived", value_set=[0, 1]
    )

    And this code adds a condition to the Expectation that specifies the value of the "Survived" column is 1 if the individual was a first class passenger:

    Python
    expectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
    column="Survived",
    value_set=[1],
    condition_parser="pandas",
    row_condition='PClass=="1st"',
    )

Data Docs and Expectation conditions

Expectations with conditions are presented differently from standard Expectations in the Data Docs. Each Expectation with conditions is prefaced with if 'row_condition_string', then values must be... as illustrated in the following image:

Image

If the 'row_condition_string' is a complex expression, it will be divided into several components to enhance readability.

Scope and limitations

While conditions can be applied to most Expectations, the following Expectations cannot be conditioned and do not accept the row_condition argument:

  • expect_column_to_exist
  • expect_table_columns_to_match_ordered_list
  • expect_table_columns_to_match_set
  • expect_table_column_count_to_be_between
  • expect_table_column_count_to_equal
  • unexpected_rows_expectation