Profiling Reference¶
Profiling produces a special kind of Data Docs that are purely descriptive.
Expectations and Profiling¶
In order to characterize a data asset, Profiling uses an Expectation Suite. Unlike the Expectations that are
typically used for data validation, these expectations do not necessarily apply any constraints; they can simply
identify statistics or other data characteristics that should be evaluated and made available in GE. For example, when
the BasicDatasetProfiler
encounters a numeric column, it will add an expect_column_mean_to_be_between
expectation but choose the min_value and max_value to both be None: essentially only saying that it expects a mean
to exist.
{
"expectation_type": "expect_column_mean_to_be_between",
"kwargs": {
"column": "rating",
"min_value": null,
"max_value": null
}
}
To “profile” a datasource, therefore, the BasicDatasetProfiler
included in GE will generate a large number of very loosely-specified expectations. Effectively
it is asserting that the given statistic is relevant for evaluating batches of that data asset, but it is not yet sure
what the statistic’s value should be.
In addition to creating an expectation suite, profiling data tests the suite against data. The validation_result contains the output of that expectation suite when validated against the same batch of data. For a loosely specified expectation like in our example above, getting the observed value was the sole purpose of the expectation.
{
"success": true,
"result": {
"observed_value": 4.05,
"element_count": 10000,
"missing_count": 0,
"missing_percent": 0
}
}
Running a profiler on a data asset can also be useful to produce a large number of expectations to review and potentially transfer to a new expectation suite used for validation in a pipeline.
How to Run Profiling¶
Run During Init¶
The great_expectations init
command will auto-generate an example Expectation Suite using a very basic profiler that
quickly glances at 1,000 rows of your data. This is not a production suite - it is only meant to show examples
of Expectations, many of which may not be meaningful.
Expectation Suites generated by the profiler will be saved in the configured expectations
directory for Expectation
Suites. The Expectation Suite name by default is the name of the profiler that generated it. Validation results will be
saved in the uncommitted/validations
directory by default. When profiling is complete, Great Expectations will
build and launch Data Docs based on your data.
Run From Command Line¶
The GE command-line interface can profile a datasource:
great_expectations datasource profile DATASOURCE_NAME
Expectation Suites generated by the profiler will be saved in the configured
expectations
directory for Expectation Suites. The Expectation Suite name by default is the name of the profiler
that generated it. Validation results will be saved in the uncommitted/validations
directory by default.
When profiling is complete, Great Expectations will build and launch Data Docs based on your data.
See Data Docs for more information.
Run From Jupyter Notebook¶
If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
# obtain the DataContext object
context = ge.data_context.DataContext()
# load a batch from the data asset
batch = context.get_batch('ratings')
# run the profiler on the batch - this returns an expectation suite and validation results for this suite
expectation_suite, validation_result = BasicDatasetProfiler.profile(batch)
# save the resulting expectation suite with a custom name
context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")
Custom Profilers¶
Like most things in Great Expectations, Profilers are designed to be extensibile. You can develop your own profiler
by subclassing DatasetProfiler
, or from the parent DataAssetProfiler
class itself. For help, advice, and ideas
on developing custom profilers, please get in touch on the Great Expectations slack channel.
Profiling Limitations¶
Inferring Data Types¶
When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.
Data Samples¶
Since profiling and expectations are so tightly linked, getting samples of expected data requires a slightly different approach than the normal path for profiling. Stay tuned for more in this area!
last updated: Aug 13, 2020