matcher module¶

class spark_matcher.matcher.Matcher(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)¶

Bases: spark_matcher.matching_base.matching_base.MatchingBase

Matcher class to apply record linkage. Provide either the column names col_names using the default string similarity metrics or explicitly define the string similarity metrics in a dict field_info as in the example below. If blocking_rules is left empty, default blocking rules are used. Otherwise provide blocking rules as a list containing BlockingRule instances (see example below). The number of perfect matches used during training is set by n_perfect_train_matches.

E.g.:

from spark_matcher.blocker.blocking_rules import FirstNChars

myMatcher = Matcher(spark_session, field_info={‘name’:[metric_function_1, metric_function_2],: ‘address:[metric_function_1, metric_function_3]}, blocking_rules=[FirstNChars(‘name’, 3)])

Parameters

spark_session – Spark session
col_names – list of column names to use for matching
field_info – dict of column names as keys and lists of string similarity metrics as values
blocking_rules – list of BlockingRule instances
n_train_samples – nr of pair samples to be created for training
ratio_hashed_samples – ratio of hashed samples to be created for training, rest is sampled randomly
n_perfect_train_matches – nr of perfect matches used for training
scorer – a Scorer object used for scoring pairs
verbose – sets verbosity

predict(sdf_1, sdf_2, threshold=0.5, top_n=None)¶

Method to predict on data used for training or new data.

Parameters

sdf_1 – first table
sdf_2 – second table
threshold – probability threshold
top_n – only return best top_n matches above threshold

Returns

Spark dataframe with the matching result