great_expectations.datasource.sparkdf_datasource

Module Contents

Classes

SparkDFDatasource(name=’default’, data_context=None, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local

great_expectations.datasource.sparkdf_datasource.logger
great_expectations.datasource.sparkdf_datasource.SparkSession
class great_expectations.datasource.sparkdf_datasource.SparkDFDatasource(name='default', data_context=None, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

Bases: great_expectations.datasource.datasource.Datasource

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local

filesystem (the default subdir_reader batch kwargs generator) and databricks notebooks.

Accepted Batch Kwargs:
  • PathBatchKwargs (“path” or “s3” keys)

  • InMemoryBatchKwargs (“dataset” key)

  • QueryBatchKwargs (“query” key)

Feature Maturity

icon-251c7e3add7311eab25a0242ac110002 Datasource - HDFS - How-to Guide
Use HDFS as an external datasource in conjunction with Spark.
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Unknown
Unit Test Coverage: Minimal (none)
Integration Infrastructure/Test Coverage: Minimal (none)
Documentation Completeness: Minimal (none)
Bug Risk: Unknown
recognized_batch_parameters
classmethod build_configuration(cls, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • batch_kwargs_generators – Generator configuration dictionary

  • spark_config – dictionary of key-value pairs to pass to the spark builder

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

process_batch_parameters(self, reader_method=None, reader_options=None, limit=None, dataset_options=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters
  • limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

  • dataset_options (dict) – a set of kwargs that will be passed to the constructor of a dataset built using these batch_kwargs

Returns

Result will include both parameters passed via argument and configured parameters.

Return type

batch_kwargs

get_batch(self, batch_kwargs, batch_parameters=None)

class-private implementation of get_data_asset

static guess_reader_method_from_path(path)
_get_reader_fn(self, reader, reader_method=None, path=None)

Static helper for providing reader_fn

Parameters
  • reader – the base spark reader to use; this should have had reader_options applied already

  • reader_method – the name of the reader_method to use, if specified

  • path (str) – the path to use to guess reader_method if it was not specified

Returns

ReaderMethod to use for the filepath