Connect to filesystem Data Assets
Use the information provided here to connect to Data Assets stored on Amazon S3, Google Cloud Storage (GCS), Microsoft Azure Blob Storage, or local filesystems. Great Expectations (GX) uses the term Data Asset when referring to data in its original format, and the term Data Source when referring to the storage location for Data Assets.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- Filesystem
Amazon S3 Data Source
Connect to an Amazon S3 Data Source.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name
: The Data Source name. In the following examples, this is"my_s3_datasource"
-
bucket_name
: The Amazon S3 bucket name. -
boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_name
andboto3_options
:Pythondatasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options forboto3_options
The parameter
boto3_options
allows you to pass the following information:endpoint_url
: specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the corresponding environment variable.region_name
: Your AWS region name.
-
Run the following Python code to pass
name
,bucket_name
, andboto3_options
as parameters when you create your Data Source::Pythondatasource = context.sources.add_pandas_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name
: The Data Source name. In the following examples, this is"my_s3_datasource"
-
bucket_name
: The Amazon S3 bucket name. -
boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_name
, andboto3_options
:Pythondatasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options forboto3_options
The parameter
boto3_options
allows you to pass the following information:endpoint_url
: Specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the corresponding environment variable.region_name
: Your AWS region name.
-
Run the following Python code to pass
name
,bucket_name
, andboto3_options
as parameters when you create your Data Source::Pythondatasource = context.sources.add_spark_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
s3_prefix=s3_prefix,
header=True,
infer_schema=True,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
Microsoft Azure Blob Storage
Connect to a Microsoft Azure Blob Storage Data Source.
- pandas
- Spark
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv
data. However, GX supports most of the Pandas read methods.
Prerequisites
- GX installed and set up to work with Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name
: The Data Source name. In the following examples, this is"my_datasource"
. -
azure_options
: Authentication settings.
-
Run the following Python code to define
name
andazure_options
:Pythondatasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
} -
Run the following Python code to pass
name
andazure_options
as parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_url
is substituted for the contents of theAZURE_STORAGE_CONNECTION_STRING
key you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
name
: A name by which you can reference the Data Asset in the future.batching_regex
: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.abs_container
: The name of your Azure Blob Storage container.abs_name_starts_with
: A string indicating what part of thebatching_regex
to truncate from the final batch names.abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
Pythondata_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv
data.
Prerequisites
- GX installed and set up to work with Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name
: The Data Source name. In the following examples, this is"my_datasource"
. -
azure_options
: Authentication settings.
-
Run the following Python code to define
name
andazure_options
:Pythondatasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
} -
Run the following Python code to pass
name
andazure_options
as parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_url
is substituted for the contents of theAZURE_STORAGE_CONNECTION_STRING
key you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
name
: A name by which you can reference the Data Asset in the future.batching_regex
: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.abs_container
: The name of your Azure Blob Storage container.abs_name_starts_with
: A string indicating what part of thebatching_regex
to truncate from the final batch names.abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
Pythondata_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
header=True,
infer_schema=True,
abs_name_starts_with=abs_name_starts_with,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
GCS Data Source
Connect to a GCS Data Source.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data in a GCS bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name
: The Data Source name. In the following examples, this is"my_gcs_datasource"
. -
bucket_or_name
: The GCS bucket or instance name. -
gcs_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_or_name
, andgcs_options
:Pythondatasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name
,bucket_or_name
, andgcs_options
as parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, gcs_prefix=gcs_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Google Cloud and authentication, see the following:
Use Spark to connect to a GCS Data Source. The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data on a GCS bucket
1. Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name
: The Data Source name. In the following examples, this is"my_gcs_datasource"
. -
bucket_or_name
: The GCS bucket or instance name. -
gcs_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name
,bucket_or_name
, andgcs_options
:Pythondatasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name
,bucket_or_name
, andgcs_options
as parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
gcs_prefix=gcs_prefix,
header=True,
infer_schema=True,
)
header
and infer_schema
In the previous example there are two optional parameters. If the file does not have a header line, the header
parameter can be left out as it will default to false
. If you do not want GX to infer the schema of your file, you can exclude the infer_schema
parameter as it also defaults to false
.
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv"
your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year
and month
to indicate exactly which file you want to request from the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Google Cloud and authentication, see the following:
Filesystem Data Source
Connect to filesystem Data Assets.
- Single file with pandas
- Multiple files with pandas
- Multiple files with Spark
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv
data. However, GX supports most of the Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Specify a file to read into a Data Asset
Run the following Python code to read the data in individual files directly into a Validator with Pandas:
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
In this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has read_*
methods for.
Because you will be using Pandas to connect to these files, the specific add_*_asset
methods that will be available to you will be determined by your currently installed version of Pandas.
For more information on which Pandas read_*
methods are available to you as add_*_asset
methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.
In the GX Python API, add_*_asset
methods will require the same parameters as the corresponding Pandas read_*
method, with one caveat: In Great Expectations, you will also be required to provide a value for an asset_name
parameter.
Create Data Source (Optional)
Modify the following code to connect to your Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems.. If you don't have data available for testing, you can use the NYC taxi data. The NYC taxi data is open source, and it is updated every month. An individual record in the data corresponds to one taxi trip.
Do not include sensitive information such as credentials in the configuration when you connect to your Data Source. This information appears as plain text in the database. If you must include credentials or a full connection string, GX recommends using a config variables file.
# Give your Datasource a name
datasource_name = None
datasource = context.sources.add_pandas(datasource_name)
# Give your first Asset a name
asset_name = None
path_to_data = None
# to use sample data uncomment next line
# path_to_data = "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)
# Build batch request
batch_request = asset.build_batch_request()
Next steps
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv
data. However, GX supports most of the Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name
: The Data Source name. -
base_directory
: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
name
andbase_directory
and store the information in the Python variablesdatasource_name
andpath_to_folder_containing_csv_files
:Pythondatasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
base_directory
of a Filesystem Data SourceIf you are using a Filesystem Data Context you can provide a path for base_directory
that is relative to the folder containing your Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
name
andbase_directory
as parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
You can access files that are nested in folders under your Data Source's base_directory
!
If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your base_directory
. When you define a Data Asset for your Data Source, you can then include the folder path (relative to your base_directory
) in the regular expression that indicates which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name
: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex
: A regular expression that matches the files to be included in the Data Asset.
batching_regex
matches multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year"
and "month"
. When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned.
For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.
-
Run the following Python code to define
name
andbatching_regex
and store the information in the Python variablesasset_name
andbatching_regex
:Pythonasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" -
Run the following Python code to pass
name
andbatching_regex
as parameters when you create your Data Asset:Pythondatasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)
Using Pandas to connect to different file typesIn this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has
read_*
methods for.Because you will be using Pandas to connect to these files, the specific
add_*_asset
methods that will be available to you will be determined by your currently installed version of Pandas.For more information on which Pandas
read_*
methods are available to you asadd_*_asset
methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.In the GX Python API,
add_*_asset
methods will require the same parameters as the corresponding Pandasread_*
method, with one caveat: In Great Expectations, you will also be required to provide a value for anasset_name
parameter.
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name
and batching_regex
parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex
of more than one Data Asset.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Pandas read_*
methods, see the Pandas Input/output documentation.
Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv
data.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name
: The Data Source name. -
base_directory
: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
name
andbase_directory
and store the information in the Python variablesdatasource_name
andpath_to_folder_containing_csv_files
:Pythondatasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"Using relative paths as thebase_directory
of a Filesystem Data SourceIf you are using a Filesystem Data Context you can provide a path for
base_directory
that is relative to the folder containing your Data Context.However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
name
andbase_directory
as parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)What if my Data Assets are located in different folders?You can access files that are nested in folders under your Data Source's
base_directory
!If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your
base_directory
. When you define a Data Asset for your Data Source, you can then include the folder path (relative to yourbase_directory
) in the regular expression that indicates which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name
: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex
: A regular expression that matches the files to be included in the Data Asset.
batching_regex
matches multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year"
and "month"
. When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned.
For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.
-
Run the following Python code to define
name
andbatching_regex
and store the information in the Python variablesasset_name
andbatching_regex
:Pythonasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"In addition, the argument
header
informs the SparkDataFrame
reader that the files contain a header column, while the argumentinfer_schema
instructs the SparkDataFrame
reader to make a best effort to determine the schema of the columns automatically. -
Run the following Python code to pass
name
andbatching_regex
and the optionalheader
andinfer_schema
arguments as parameters when you create your Data Asset:Pythondatasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name
and batching_regex
parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex
of more than one Data Asset.
Next steps
Related documentation
For more information about storing credentials for use with GX, see How to configure credentials.