How to configure a Pandas/filesystem Datasource

This guide shows how to connect to a Pandas Datasource such that the data is accessible in the form of files on a local or NFS type of a filesystem.

Prerequisites: This how-to guide assumes you have already:

  • Set up a working deployment of Great Expectations


To add a filesystem-backed Pandas datasource do this:

  1. Run datasource new

    From the command line, run:

    great_expectations datasource new
  2. Choose “Files on a filesystem (for processing with Pandas or Spark)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    : 1
  3. Choose Pandas

    What are you processing your files with?
        1. Pandas
        2. PySpark
    : 1
  4. Specify the directory path for data files

    Enter the path (relative or absolute) of the root directory where the data files are stored.
    : /path/to/directory/containing/your/data/files
  5. Give your Datasource a name

    When prompted, provide a custom name for your filesystem-backed Pandas data source, or hit Enter to accept the default.

    Give your new Datasource a short name.

    Great Expectations will now add a new Datasource ‘my_data_files_dir’ to your deployment, by adding this entry to your great_expectations.yml:

        class_name: PandasDataset
        module_name: great_expectations.dataset
          class_name: SubdirReaderBatchKwargsGenerator
          base_directory: /path/to/directory/containing/your/data/files
      class_name: PandasDatasource
      Would you like to proceed? [Y/n]:
  6. Wait for confirmation

    If all goes well, it will be followed by the message:

    A new datasource 'my_data_files_dir' was added to your project.

    If you run into an error, you will see something like:

    Error: Directory '/nonexistent/path/to/directory/containing/your/data/files' does not exist.
    Enter the path (relative or absolute) of the root directory where the data files are stored.

    In this case, please check your data directory path, permissions, etc. and try again.

  7. Finally, if all goes well and you receive a confirmation on your Terminal screen, you can proceed with exploring the data sets in your new filesystem-backed Pandas data source.

Additional Notes

  1. Relative path locations should be specified from the perspective of the directory, in which the

    great_expectations datasource new

    command is executed.