matching_base module¶

class spark_matcher.matching_base.MatchingBase(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)¶

Bases: object

fit(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → spark_matcher.matching_base.matching_base.MatchingBase¶

Fit the MatchingBase instance on the two dataframes sdf_1 and sdf_2 using active learning. You will be prompted to enter whether the presented pairs are a match or not. Note that sdf_2 is an optional argument. sdf_2 is used for Matcher (i.e. matching one table to another). In the case of Deduplication, only providing sdf_1 is sufficient, in that case sdf_1 will be deduplicated.

Parameters

sdf_1 – Spark dataframe
sdf_2 – Optional: Spark dataframe

Returns

Fitted MatchingBase instance

load(path: str) → None¶

Load a previously trained and saved Matcher instance.

Parameters: path – Path and file name of pickle file

save(path: str) → None¶

Save the current instance to a pickle file.

Parameters: path – Path and file name of pickle file