How to Configure a Data Connector to Sort Batches¶
Warning
Sorters are configured through a Data Connector, which is a concept that was introduced in the V3 (BatchRequest) API. The following configuration is not supported by the V2 (batch_kwargs) API.
This guide demonstrates how to sort Batches of data that Great Expectations can validate from a filesystem using a configured DataConnector
. By default, Great Expectations will sort Batches lexicographically according to their data_references
.
Sorters allow for more control by enabling sorting by fields that can be captured by the DataAsset’s regex pattern
. The Batches can then be sorted lexicographically, numerically, by datetime, or even in combination with custom filter functions.
For this how-to-guide, we will be using a ConfiguredAssetFilesystemDataConnector
as an example.
To read more about DataConnectors please refer to the doc: How to choose which DataConnector to use.
Prerequisites: This how-to guide assumes you have already:
Learned how to configure a DataContext using test_yaml_config
Learned how to create a Batch
Contents¶
Configuring a Lexicographical Sorter¶
If you have the following lexicographic_example/
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset:
lexicographic_example/test_aaa.csv lexicographic_example/test_bbb.csv lexicographic_example/test_ccc.csv lexicographic_example/test_ddd.csv lexicographic_example/test_eee.csv
Note : In our example, the base_directory
is set to /home/my_work_directory/
, which is where the lexicographic_example/
folder lives (ie /home/my_work_directory/lexicographic_example/test_aaa.csv
).
However, it can also be assigned to an a relative path like ../
as can be seen in the examples below.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_reports
. It has thebase_directory
set tolexicographic_example/
and the regexpattern
is set to capture twogroup_names
,name
andletter
. ALexicographicalSorter
is configured for theletter
capture group, which captures the section of the file name that looks like :aaa
, and sorts the Batches in descending (reverse-alphabetical) order.config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: /home/my_work_directory/ default_regex: pattern: (.+)_(.+)\\.csv group_names: - name - letter sorters: - orderby: desc class_name: LexicographicSorter name: letter assets: my_data_asset: base_directory: lexicographic_example/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
test_eee.csv
, showing that the Batches have been sorted correctly.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['test_eee.csv', 'test_ddd.csv', 'test_ccc.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a BatchRequest.
The following BatchRequest will retrieve the first Batch from
mydatasource
corresponding totest_eee.csv
by using index0
as thedata_connector_query
.batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0 } )
Construct a Validator
The BatchRequest and ExpectationSuite can be used to create a Validator.
my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding totest_eee.csv
, namely"{'name': 'test', 'letter': 'eee'}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'test', 'letter': 'eee'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.
Configuring a Numeric Sorter¶
If you have the following numeric_example/
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset:
numeric_example/test_111.csv numeric_example/test_222.csv numeric_example/test_333.csv numeric_example/test_444.csv numeric_example/test_555.csv
Note : In our example, the base_directory
is set to ../
. If we are running this Notebook in the same folder as Great Expectations home directory (ie great_expectations/
),
GX will begin looking for the files in the parent directory.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_data_asset
. It has thebase_directory
set tonumeric_example/
and the regexpattern
is set to capture twogroup_names
,name
andnumber
. ANumericSorter
is configured for thenumber
capture group, which captures the section of the file name that looks like :111
, and sorts the Batches in decreasing order.config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: ../ default_regex: pattern: (.+)_(\\d.*)\\.csv group_names: - name - number sorters: - orderby: desc class_name: NumericSorter name: number assets: my_data_asset: base_directory: numeric_example/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
test_555.csv
, showing that the Batches have been sorted correctly.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['test_555.csv', 'test_444.csv', 'test_333.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a BatchRequest.
The following BatchRequest will retrieve a the first Batch corresponding to
test_555.csv
by using index0
as thedata_connector_query
.batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0 } )
Construct a Validator
The
BatchRequest
and ExpectationSuite can be used to create a Validator.my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding totest_555.csv
namely"{'name': 'test', 'number': '555'}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'test', 'number': '555'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.
Configuring a Datetime Sorter¶
If you have the following datetime_example/
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset:
datetime_example/test_20201229.csv datetime_example/test_20201230.csv datetime_example/test_20201231.csv datetime_example/test_20210101.csv datetime_example/test_20210102.csv
Note : In our example, the base_directory
is set to ../
. If we are running this Notebook in the same folder as Great Expectations home directory (ie great_expectations/
),
GX will begin looking for the files in the parent directory.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_data_asset
. It has thebase_directory
set todatetime_example/
and the regexpattern
is set to capture twogroup_names
,name
anddate
. ADateTimeSorter
is configured for thedate
capture group, which captures the section of the file name that looks like :20210102
, and sorts in descending order. The configuration forDateTimeSorter
also includes an optionaldatetime_format
parameter, which allows the you to specify the pattern in datetime format (default is%Y%m%d
).config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: ../ default_regex: pattern: (.+)_(.+)\\.csv group_names: - name - date sorters: - orderby: desc class_name: DateTimeSorter datetime_format: "%Y%m%d" name: date assets: my_data_asset: base_directory: datetime_example/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
test_20210102.csv
, showing that the Batches have been sorted correctly.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['test_20210102.csv', 'test_20210101.csv', 'test_20201231.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a BatchRequest.
The following BatchRequest will retrieve a the first Batch corresponding to
test_20210102.csv
by using index0
as thedata_connector_query
.batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0 } )
Construct a Validator
The BatchRequest and ExpectationSuite can be used to create a Validator.
my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding totest_20210102.csv
namely"{'name': 'test', 'date': '20210102'}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'test', 'date': '20210102'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.
Configuring a CustomList Sorter¶
Great Expectations also allows Sorters to be configured against an ordering defined in a custom list (such as Periodic Table of Elements, or list of Marvel movies leading up to Avengers: Endgame).
If you have the following elements/
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset:
elements_example/test_H.csv elements_example/test_He.csv elements_example/test_Li.csv elements_example/test_Be.csv elements_example/test_B.csv elements_example/test_C.csv
Note : In our example, the base_directory
is set to ../
. If we are running this Notebook in the same folder as Great Expectations home directory (ie great_expectations/
),
GX will begin looking for the files in the parent directory.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_reports
It has thebase_directory
set toreports/
and the regexpattern
is set to capture twogroup_names
,name
andelement
.A
CustomListSorter
is configured for theelement
capture group and sorts the Batches in ascending order. We also configure the requiredreference_list
parameter, passing in a custom list (my_custom_list
) containing the first 6 elements in the Periodic Table of Elements.# custom list that we are passing containing the ordering for the first 6 elements my_custom_list = ["H", "He", "Li", "Be", "B", "C"] config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: ../ default_regex: pattern: (.+)_(.+)\\.csv group_names: - name - element sorters: - orderby: asc class_name: CustomListSorter reference_list: {my_custom_list} name: element assets: my_data_asset: base_directory: elements_example/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
test_H.csv
, showing that the Batches have been sorted correctly.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['test_H.csv', 'test_He.csv', 'test_Li.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a BatchRequest.
The following BatchRequest will retrieve a the first Batch corresponding to
test_H.csv
by using index0
as thedata_connector_query
.batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0 } )
Construct a Validator
The
BatchRequest
and ExpectationSuite can be used to create a Validator.my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding totest_H.csv
namely"{'name': 'test', 'element': 'H'}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'test', 'element': 'H'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.
Configuring Multiple Sorters¶
If your configuration contains multiple sorters, they will be applied in order of their configuration. If you have the following multiple_sorters_example/
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset, sorting them by 1) DateTime 2) Lexicographically 3) Numerically :
multiple_sorters_example/test_AAA_111_20201230.csv multiple_sorters_example/test_BBB_222_20201231.csv multiple_sorters_example/test_CCC_333_20210101.csv multiple_sorters_example/test_DDD_444_20210102.csv multiple_sorters_example/test_EEE_555_20210103.csv
Note : In our example, the base_directory
is set to ../
. If we are running this Notebook in the same folder as Great Expectations home directory (ie great_expectations/
),
GX will begin looking for the files in the parent directory.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_data_asset
It has thebase_directory
set tomultiple_sorters_example/
and the regexpattern
is set to capture 4group_names
:name
,letter
,number
anddatetime
.We also have 3 Sorters configured, first
DateTimeSorter
for thedatetime
field (which sorts in ascending order), aLexicographicSorter
for theletter
field (which sorts in descending order), and aNumericSorter
for thenumber
field (which sorts in descending order).config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: ../ default_regex: pattern: (.+)_(.+)_(\\d.*)_(.+)\\.csv group_names: - name - letter - number - datetime sorters: - orderby: asc class_name: DateTimeSorter name: datetime - orderby: desc class_name: LexicographicSorter name: letter - orderby: desc class_name: NumericSorter name: number assets: my_data_asset: base_directory: multiple_sorters_example/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
test_AAA_111_20201230.csv
, showing that the Batches have been sorted correctly.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['test_AAA_111_20201230.csv', 'test_BBB_222_20201231.csv', 'test_CCC_333_20210101.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a BatchRequest.
The following BatchRequest will retrieve a the first Batch corresponding to
test_AAA_111_20201230.csv
by using index0
as thedata_connector_query
.batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0 } )
Construct a Validator
The BatchRequest and ExpectationSuite can be used to create a Validator.
my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding totest_AAA_111_20201230.csv
namely"{'name': 'test', 'letter': 'AAA', 'number': '111', 'datetime': '20201230'}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'test', 'letter': 'AAA', 'number': '111', 'datetime': '20201230'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.
Configuring Sorters with Custom Filters¶
You can also use Sorters in combination with custom filter functions that are passed with a BatchRequest. If you have the following year_reports
directory in your filesystem, and you want to treat *.csv
files as batches within the my_data_asset
DataAsset,
and we only wanted to consider the reports on or after 2000, and in ascending order:
year_reports/report_1980.csv year_reports/report_1990.csv year_reports/report_2000.csv year_reports/report_2010.csv year_reports/report_2020.csv
Note : In our example, the base_directory
is set to ../
. If we are running this Notebook in the same folder as Great Expectations home directory (ie great_expectations/
),
GX will begin looking for the files in the parent directory.
Load or create a DataContext
import great_expectations as gx from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource from great_expectations.core.batch import BatchRequest context = gx.get_context()
Configure a Datasource
In the following configuration, a Datasource is configured with a
PandasExecutionEngine
andConfiguredAssetFilesystemDataConnector
. The DataConnector is configured with a single DataAsset namedmy_reports
It has thebase_directory
set toyear_reports/
and the regexpattern
is set to capture twogroup_names
:name
,year
. We also have aNumericSorter
configured, to sort theyear
field in ascending order.config = f""" name: mydatasource class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: my_data_connector: module_name: great_expectations.datasource.data_connector class_name: ConfiguredAssetFilesystemDataConnector glob_directive: "*.csv" base_directory: ../ default_regex: pattern: (.+)_(\\d.*)\\.csv group_names: - name - year sorters: - orderby: asc class_name: NumericSorter name: year assets: my_data_asset: base_directory: year_reports/ """
(Optional) run
test_yaml_config()
to ensure that your configuration is working.
context.test_yaml_config( yaml_config=config )If the configuration is correct you should see output similar to this. Notice that the data asset names start with
report_1980.csv
, showing that the Batches have been sorted correctly, and we still have not filtered for reports after the year 2000.Attempting to instantiate class from config... Instantiating as a Datasource, since class_name is Datasource Successfully instantiated Datasource ExecutionEngine class name: PandasExecutionEngine Data Connectors: my_data_connector : ConfiguredAssetFilesystemDataConnector Available data_asset_names (1 of 1): my_data_asset (3 of 5): ['report_1980.csv', 'report_1990.csv', 'report_2000.csv'] Unmatched data_references (0 of 0): []
Save Configuration
# save the configuration and re-instantiate the data context with our newly configured datasource sanitize_yaml_and_save_datasource(context, config, overwrite_existing=False) context = gx.get_context()
Obtain an ExpectationSuite
Your DataContext can be used to create or retrieve an ExpectationSuite.
suite = context.get_expectation_suite("insert_your_expectation_suite_name_here")Alternatively, if you have not already created a suite, you can do so now.
suite = context.create_expectation_suite("insert_your_expectation_suite_name_here")
Construct a
BatchRequest
.
The following
BatchRequest
will retrieve a Batch corresponding to2000.csv
by using index0` and a ``custom_filter_function
which takes inbatch_identifiers
as a dictionary, and applies a filter on theyear
key.# only select files from on or after 2000 def my_custom_batch_selector(batch_identifiers: dict) -> bool: return int(batch_identifiers["year"]) >= 2000 batch_request = BatchRequest( datasource_name="mydatasource", data_connector_name="my_data_connector", data_asset_name="my_data_asset", data_connector_query={ "index": 0, "custom_filter_function": my_custom_batch_selector, } )
Construct a Validator
The
BatchRequest
and ExpectationSuite can be used to create a Validator.my_validator = context.get_validator( batch_request=batch_request, expectation_suite=suite )
Check your Validator
You can check to see if the correct Batch was retrieved by checking the
active_batch
’sbatch_definition
.my_validator.active_batch.batch_definitionThe expected output should show
batch_identifiers
corresponding to2000.csv
namely"{'name': 'report', 'year': 2000}"}
{'datasource_name': 'mydatasource', 'data_connector_name': 'my_data_connector', 'data_asset_name': 'my_reports', 'batch_identifiers': "{'name': 'report', 'year': '2000'}"}You can also check that the first few lines of your Batch are what you expect by running:
my_validator.active_batch.head()Now that you have a Validator, you can use it to create Expectations or validate the data.