matcher module¶
-
class
spark_matcher.matcher.
Matcher
(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)¶ Bases:
spark_matcher.matching_base.matching_base.MatchingBase
Matcher class to apply record linkage. Provide either the column names col_names using the default string similarity metrics or explicitly define the string similarity metrics in a dict field_info as in the example below. If blocking_rules is left empty, default blocking rules are used. Otherwise provide blocking rules as a list containing BlockingRule instances (see example below). The number of perfect matches used during training is set by n_perfect_train_matches.
E.g.:
from spark_matcher.blocker.blocking_rules import FirstNChars
- myMatcher = Matcher(spark_session, field_info={‘name’:[metric_function_1, metric_function_2],
‘address:[metric_function_1, metric_function_3]}, blocking_rules=[FirstNChars(‘name’, 3)])
- Parameters
spark_session – Spark session
col_names – list of column names to use for matching
field_info – dict of column names as keys and lists of string similarity metrics as values
blocking_rules – list of BlockingRule instances
n_train_samples – nr of pair samples to be created for training
ratio_hashed_samples – ratio of hashed samples to be created for training, rest is sampled randomly
n_perfect_train_matches – nr of perfect matches used for training
scorer – a Scorer object used for scoring pairs
verbose – sets verbosity
-
predict
(sdf_1, sdf_2, threshold=0.5, top_n=None)¶ Method to predict on data used for training or new data.
- Parameters
sdf_1 – first table
sdf_2 – second table
threshold – probability threshold
top_n – only return best top_n matches above threshold
- Returns
Spark dataframe with the matching result