data module

spark_matcher.data.datasets.load_data(spark: pyspark.sql.session.SparkSession, kind: Optional[str] = 'voters')Union[Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame], pyspark.sql.dataframe.DataFrame]

Load examples datasets to be used to experiment with spark-matcher. For matching problems, set kind to voters for North Carolina voter registry data or library for bibliography data. For deduplication problems, set kind to stoxx50 for EuroStoxx 50 company names and addresses.

Voter data:
Library data:
Parameters
  • spark – Spark session

  • kind – kind of data: voters, library or stoxx50

Returns

two Spark dataframes for voters or library, a single dataframe for stoxx50