Profiling

Profiling is a way of Rendering Validation Results to produce a summary of observed characteristics. When Validation Results are rendered as Profiling data, they create a new section in Data Docs. By computing the observed properties of data, Profiling helps to understand and reason about the data’s expected properties.

To produce a useful data overview, Great Expectations uses a profiler to build a special Expectation Suite. Unlike the Expectations that are typically used for data validation, expectations for Profiling do not necessarily apply any constraints. They can simply identify statistics or other data characteristics that should be evaluated and made available in Great Expectations. For example, when the included BasicDatasetProfiler encounters a numeric column, it will add an expect_column_mean_to_be_between expectation but choose the min_value and max_value to both be None: essentially only saying that it expects a mean to exist.

The default BasicDatasetProfiler will thus produce a page for each table or DataFrame including an overview section:

../_images/movie_db_profiling_screenshot_2.jpg

And then detailed statistics for each column:

../_images/movie_db_profiling_screenshot_1.jpg

Profiling is still a beta feature in Great Expectations. Over time, we plan to extend and improve the BasicDatasetProfiler and also add additional profilers.

Warning: BasicDatasetProfiler will evaluate the entire batch without limits or sampling, which may be very time consuming. As a rule of thumb, we recommend starting with small batches of data.

See the Profiling Reference for more information.

last updated: Aug 13, 2020