deduplicator module

class spark_matcher.deduplicator.Deduplicator(spark_session: pyspark.sql.session.SparkSession, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0, max_edges_clustering: int = 500000, edge_filter_thresholds: List[float] = [0.45, 0.55, 0.65, 0.75, 0.85, 0.95], cluster_score_threshold: float = 0.5, cluster_linkage_method: str = 'centroid')

Bases: spark_matcher.matching_base.matching_base.MatchingBase

Deduplicator class to apply deduplication. Provide either the column names col_names using the default string similarity metrics or explicitly define the string similarity metrics in a dict field_info as in the example below. If blocking_rules is left empty, default blocking rules are used. Otherwise, provide blocking rules as a list containing BlockingRule instances (see example below). The number of perfect matches used during training is set by n_perfect_train_matches.


from spark_matcher.blocker.blocking_rules import FirstNChars

myDeduplicator = Deduplicator(spark_session, field_info={‘name’:[metric_function_1, metric_function_2],

‘address:[metric_function_1, metric_function_3]}, blocking_rules=[FirstNChars(‘name’, 3)])

  • spark_session – Spark session

  • col_names – list of column names to use for matching

  • field_info – dict of column names as keys and lists of string similarity metrics as values

  • blocking_rules – list of BlockingRule instances

  • table_checkpointer – pointer object to store cached tables

  • checkpoint_dir – checkpoint directory if provided

  • n_train_samples – nr of pair samples to be created for training

  • ratio_hashed_samples – ratio of hashed samples to be created for training, rest is sampled randomly

  • n_perfect_train_matches – nr of perfect matches used for training

  • scorer – a Scorer object used for scoring pairs

  • verbose – sets verbosity

  • max_edges_clustering – max number of edges per component that enters clustering

  • edge_filter_thresholds – list of score thresholds to use for filtering when components are too large

  • cluster_score_threshold – threshold value between [0.0, 1.0], only pairs are put together in clusters if cluster similarity scores are >= cluster_score_threshold

  • cluster_linkage_method – linkage method to be used within hierarchical clustering, can take values such as

  • 'centroid'

  • 'single'

  • 'complete'

  • 'average'

  • 'weighted'

  • 'median'

  • etc. ('ward') –

predict(sdf: pyspark.sql.dataframe.DataFrame, threshold: float = 0.5)

Method to predict on data used for training or new data.

  • sdf – table to be applied entity deduplication

  • threshold – probability threshold for similarity score


Spark dataframe with the deduplication result