matching_base module¶
-
class
spark_matcher.matching_base.
MatchingBase
(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)¶ Bases:
object
-
fit
(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → spark_matcher.matching_base.matching_base.MatchingBase¶ Fit the MatchingBase instance on the two dataframes sdf_1 and sdf_2 using active learning. You will be prompted to enter whether the presented pairs are a match or not. Note that sdf_2 is an optional argument. sdf_2 is used for Matcher (i.e. matching one table to another). In the case of Deduplication, only providing sdf_1 is sufficient, in that case sdf_1 will be deduplicated.
- Parameters
sdf_1 – Spark dataframe
sdf_2 – Optional: Spark dataframe
- Returns
Fitted MatchingBase instance
-
load
(path: str) → None¶ Load a previously trained and saved Matcher instance.
- Parameters
path – Path and file name of pickle file
-
save
(path: str) → None¶ Save the current instance to a pickle file.
- Parameters
path – Path and file name of pickle file
-