matcher module

class spark_matcher.matcher.Matcher(spark_session: pyspark.sql.session.SparkSession, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0)

Bases: spark_matcher.matching_base.matching_base.MatchingBase

Matcher class to apply record linkage. Provide either the column names col_names using the default string similarity metrics or explicitly define the string similarity metrics in a dict field_info as in the example below. If blocking_rules is left empty, default blocking rules are used. Otherwise provide blocking rules as a list containing BlockingRule instances (see example below). The number of perfect matches used during training is set by n_perfect_train_matches.

E.g.:

from spark_matcher.blocker.blocking_rules import FirstNChars

myMatcher = Matcher(spark_session, field_info={‘name’:[metric_function_1, metric_function_2],

‘address:[metric_function_1, metric_function_3]}, blocking_rules=[FirstNChars(‘name’, 3)])

Parameters
  • spark_session – Spark session

  • col_names – list of column names to use for matching

  • field_info – dict of column names as keys and lists of string similarity metrics as values

  • blocking_rules – list of BlockingRule instances

  • n_train_samples – nr of pair samples to be created for training

  • ratio_hashed_samples – ratio of hashed samples to be created for training, rest is sampled randomly

  • n_perfect_train_matches – nr of perfect matches used for training

  • scorer – a Scorer object used for scoring pairs

  • verbose – sets verbosity

predict(sdf_1, sdf_2, threshold=0.5, top_n=None)

Method to predict on data used for training or new data.

Parameters
  • sdf_1 – first table

  • sdf_2 – second table

  • threshold – probability threshold

  • top_n – only return best top_n matches above threshold

Returns

Spark dataframe with the matching result