blocker module

class spark_matcher.blocker.BlockLearner(blocking_rules: List[spark_matcher.blocker.blocking_rules.BlockingRule], recall: float, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, verbose=0)

Bases: object

Class to learn blocking rules from training data.


list of BlockingRule objects that are taken into account during block learning


the minimum required percentage of training pairs that are covered by the learned blocking rules


set verbosity

fit(sdf: pyspark.sql.dataframe.DataFrame)spark_matcher.blocker.block_learner.BlockLearner

This method fits, i.e. learns, the blocking rules that are needed to cover recall percent of the training set pairs. The fitting is done by solving the set-cover problem. It is solved by using a greedy algorithm.


sdf – a labelled training set containing pairs.


the object itself

transform(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None)Union[pyspark.sql.dataframe.DataFrame, Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame]]

This method adds the block-keys to the input dataframes. It applies all the learned blocking rules on the input data and unifies the results. The result of this method is/are the input dataframe(s) containing the block-keys from the learned blocking rules.

  • sdf_1 – dataframe containing records

  • sdf_2 – dataframe containing records


dataframe(s) containing block-keys from the learned blocking-rules

class spark_matcher.blocker.BlockingRule(blocking_column: str)

Bases: abc.ABC

Abstract class for blocking rules. This class contains all the base functionality for blocking rules.


the column on which the BlockingRule is applied

calculate_training_set_coverage(sdf: pyspark.sql.dataframe.DataFrame)spark_matcher.blocker.blocking_rules.BlockingRule

This method calculate the set coverage of the blocking rule on the training pairs. The set coverage of the rule is determined by looking at how many record pairs in the training set end up in the same block. This coverage is used in the BlockLearner to sort blocking rules in the greedy set_covering algorithm. :param sdf: a dataframe containing record pairs for training


The object itself

create_block_key(sdf: pyspark.sql.dataframe.DataFrame)pyspark.sql.dataframe.DataFrame

This method calculates and adds the block-key column to the input dataframe


sdf – a dataframe with records that need to be matched


the dataframe with the block-key column