data module¶
-
spark_matcher.data.datasets.
load_data
(spark: pyspark.sql.session.SparkSession, kind: Optional[str] = 'voters') → Union[Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame], pyspark.sql.dataframe.DataFrame]¶ Load examples datasets to be used to experiment with spark-matcher. For matching problems, set kind to voters for North Carolina voter registry data or library for bibliography data. For deduplication problems, set kind to stoxx50 for EuroStoxx 50 company names and addresses.
- Voter data:
provided by Prof. Erhard Rahm https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution
- Library data:
DBLP bibliography, http://www.informatik.uni-trier.de/~ley/db/index.html
ACM Digital Library, http://portal.acm.org/portal.cfm
- Parameters
spark – Spark session
kind – kind of data: voters, library or stoxx50
- Returns
two Spark dataframes for voters or library, a single dataframe for stoxx50