deduplicator module¶

class spark_matcher.deduplicator.Deduplicator(spark_session: pyspark.sql.session.SparkSession, col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, blocking_rules: Optional[List[spark_matcher.blocker.blocking_rules.BlockingRule]] = None, blocking_recall: float = 1.0, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, checkpoint_dir: Optional[str] = None, n_perfect_train_matches=1, n_train_samples: int = 100000, ratio_hashed_samples: float = 0.5, scorer: Optional[spark_matcher.scorer.scorer.Scorer] = None, verbose: int = 0, max_edges_clustering: int = 500000, edge_filter_thresholds: List[float] = [0.45, 0.55, 0.65, 0.75, 0.85, 0.95], cluster_score_threshold: float = 0.5, cluster_linkage_method: str = 'centroid')¶

Bases: spark_matcher.matching_base.matching_base.MatchingBase

Deduplicator class to apply deduplication. Provide either the column names col_names using the default string similarity metrics or explicitly define the string similarity metrics in a dict field_info as in the example below. If blocking_rules is left empty, default blocking rules are used. Otherwise, provide blocking rules as a list containing BlockingRule instances (see example below). The number of perfect matches used during training is set by n_perfect_train_matches.

E.g.:

from spark_matcher.blocker.blocking_rules import FirstNChars

myDeduplicator = Deduplicator(spark_session, field_info={‘name’:[metric_function_1, metric_function_2],: ‘address:[metric_function_1, metric_function_3]}, blocking_rules=[FirstNChars(‘name’, 3)])

Parameters

spark_session – Spark session
col_names – list of column names to use for matching
field_info – dict of column names as keys and lists of string similarity metrics as values
blocking_rules – list of BlockingRule instances
table_checkpointer – pointer object to store cached tables
checkpoint_dir – checkpoint directory if provided
n_train_samples – nr of pair samples to be created for training
ratio_hashed_samples – ratio of hashed samples to be created for training, rest is sampled randomly
n_perfect_train_matches – nr of perfect matches used for training
scorer – a Scorer object used for scoring pairs
verbose – sets verbosity
max_edges_clustering – max number of edges per component that enters clustering
edge_filter_thresholds – list of score thresholds to use for filtering when components are too large
cluster_score_threshold – threshold value between [0.0, 1.0], only pairs are put together in clusters if cluster similarity scores are >= cluster_score_threshold
cluster_linkage_method – linkage method to be used within hierarchical clustering, can take values such as
'centroid' –
'single' –
'complete' –
'average' –
'weighted' –
'median' –
etc. ('ward') –

predict(sdf: pyspark.sql.dataframe.DataFrame, threshold: float = 0.5)¶

Method to predict on data used for training or new data.

Parameters

sdf – table to be applied entity deduplication
threshold – probability threshold for similarity score

Returns

Spark dataframe with the deduplication result