Create your first Expectations

Expectations are the key concept in Great Expectations.

Each Expectation is a declarative, machine-verifiable assertion about the expected format, content, or behavior of your data. Great Expectations comes with dozens of built-in Expectations, and it’s possible to develop your own custom Expectations, too.

Admonition from Mr. Dickens.

“Take nothing on its looks; take everything on evidence. There’s no better rule.”

The CLI will help you create your first Expectation Suite. Expectation Suites are simply collections of Expectations. In order to create a new suite, we will use the scaffold command to create an Expectation Suite called taxi.demo with the help of a built-in profiler. Type the following into your terminal:

great_expectations suite scaffold taxi.demo

You will see the following output:

Heads up! This feature is Experimental. It may change. Please give us your feedback!

Which table would you like to use? (Choose one)
    1. yellow_tripdata_sample_2019_01 (table)
    2. yellow_tripdata_staging (table)
    Do not see the table in the list above? Just type the SQL query
: 1

Which data should I choose when creating Expectations?

In this example, we want to build an Expectation Suite based on what we know about our taxi data: each taxi ride has a passenger_count between 1 and 6. We then want to use that Expectation Suite to validate any future data that is loaded into the staging table. Hence, we choose the yellow_tripdata_sample_2019_01 table when creating the new Expectation Suite, which we will then use to validate the yellow_tripdata_staging table in a later step. Makes sense, right?

After selecting the table, Great Expectations will open a Jupyter notebook which will take you through the next part of the scaffold workflow.

Creating Expectations in Jupyter notebooks

In this section, we will explain how to create and edit Expectations in a Jupyter notebook. Notebooks are a simple way of interacting with the Great Expectations Python API. You could also just write all this in plain Python code, but for convenience, Great Expectations provides you some boilerplate code in notebooks.

Since notebooks are often less permanent, creating Expectations in a notebook also helps reinforce that the source of truth about Expectations is the Expectation Suite, not the code that generates the Expectations.

../../../_images/jupyter_scaffold.gif

Here’s what’s happening in the notebook:

  1. The first cell does several things: It imports all the relevant libraries, loads a Data Context, and creates what we call a Batch of your data and Expectation Suite.

  2. The second cell allows you to specify which columns you want to run the automated profiler on. Remember how we want to add some tests on the passenger_count column to ensure that its values range between 1 and 6? Let’s uncomment just that one line:

    included_columns = [
        # 'vendor_id',
        # 'pickup_datetime',
        # 'dropoff_datetime',
        'passenger_count',
        ...
    ]
    
  3. The next cell passes the profiler config to the BasicSuiteBuilderProfiler, which will then profile the data and create the relevant Expectations to add to your taxi.demo suite.

  4. The last cell does several things again: It saves the Expectation Suite to disk, runs the validation against the loaded data batch, and then builds and opens Data Docs, so you can look at the validation results.

Let’s execute all the cells and wait for Great Expectations to open a browser window with Data Docs. Pause here to read on first and find out what just happened!

What just happened?

You can create and edit Expectations using several different workflows. The CLI just used one of the quickest and simplest: scaffolding Expectations using an automated Profiler.

This Profiler connected to your data (using the Datasource you configured in the previous step), took a quick look at the contents, and produced an initial set of Expectations. These Expectations are not intended to be very smart. Instead, the goal is to quickly provide some good examples, so that you’re not starting from a blank slate.

Later, you should also take a look at other workflows for Creating and editing Expectations. Creating and editing Expectations is a very active area of work in the Great Expectations community. Stay tuned for improvements over time.

A first look at real Expectations

The newly profiled Expectations are stored in an Expectation Suite.

By default, Expectation Suites are stored in a JSON file in a subdirectory of your great_expectations/ folder. You can also configure Great Expectations to store Expectations to other locations, such as S3, Postgres, etc. We’ll come back to these options in the last step of the tutorial.

If you open up the file at great_expectations/expectations/taxi/demo.json in a text editor, you’ll see the following:

{
  "data_asset_type": "Dataset",
  "expectation_suite_name": "taxi.demo",
  "expectations": [

    ...

    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "passenger_count"
      },
      "meta": {
        "BasicSuiteBuilderProfiler": {
          "confidence": "very low"
        }
      }
    },
    {
      "expectation_type": "expect_column_distinct_values_to_be_in_set",
      "kwargs": {
        "column": "passenger_count",
        "value_set": [
          1.0,
          2.0,
          3.0,
          4.0,
          5.0,
          6.0
        ]
      },
      "meta": {
        "BasicSuiteBuilderProfiler": {
          "confidence": "very low"
        }
      }
    },
    ...

There’s a lot of information in the JSON file. We will focus on just the snippet above:

Every Expectation in the file expresses a test that can be validated against data. You can see that the Profiler generated several Expectations based on our data, including expect_column_distinct_values_to_be_in_set, with the value_set containing the numbers 1 through 6. This is exactly what we wanted: An assertion that the passenger_count column contains only those values!

Now we only have two problems left to solve:

  1. These dense JSON objects are very hard to read. How can we have a nicer representation of our Expectations?

  2. How do we use this Expectation Suite to validate that new batch of data we have in our staging table?

In the next step, we’ll address the first question and show you how to use Data Docs to interact with Expectation Suites in a more human-friendly format.