matching_base module

class spark_matcher.matching_base.MatchingBase(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)

Bases: object

fit(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None)spark_matcher.matching_base.matching_base.MatchingBase

Fit the MatchingBase instance on the two dataframes sdf_1 and sdf_2 using active learning. You will be prompted to enter whether the presented pairs are a match or not. Note that sdf_2 is an optional argument. sdf_2 is used for Matcher (i.e. matching one table to another). In the case of Deduplication, only providing sdf_1 is sufficient, in that case sdf_1 will be deduplicated.

  • sdf_1 – Spark dataframe

  • sdf_2 – Optional: Spark dataframe


Fitted MatchingBase instance

load(path: str)None

Load a previously trained and saved Matcher instance.


path – Path and file name of pickle file

save(path: str)None

Save the current instance to a pickle file.


path – Path and file name of pickle file