blocker module

class spark_matcher.blocker.BlockLearner(blocking_rules: List[spark_matcher.blocker.blocking_rules.BlockingRule], recall: float, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, verbose=0)

Bases: object

Class to learn blocking rules from training data.

blocking_rules

list of BlockingRule objects that are taken into account during block learning

recall

the minimum required percentage of training pairs that are covered by the learned blocking rules

verbose

set verbosity

fit(sdf: pyspark.sql.dataframe.DataFrame)spark_matcher.blocker.block_learner.BlockLearner

This method fits, i.e. learns, the blocking rules that are needed to cover recall percent of the training set pairs. The fitting is done by solving the set-cover problem. It is solved by using a greedy algorithm.

Parameters

sdf – a labelled training set containing pairs.

Returns

the object itself

transform(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None)Union[pyspark.sql.dataframe.DataFrame, Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame]]

This method adds the block-keys to the input dataframes. It applies all the learned blocking rules on the input data and unifies the results. The result of this method is/are the input dataframe(s) containing the block-keys from the learned blocking rules.

Parameters
  • sdf_1 – dataframe containing records

  • sdf_2 – dataframe containing records

Returns

dataframe(s) containing block-keys from the learned blocking-rules

class spark_matcher.blocker.BlockingRule(blocking_column: str)

Bases: abc.ABC

Abstract class for blocking rules. This class contains all the base functionality for blocking rules.

blocking_column

the column on which the BlockingRule is applied

calculate_training_set_coverage(sdf: pyspark.sql.dataframe.DataFrame)spark_matcher.blocker.blocking_rules.BlockingRule

This method calculate the set coverage of the blocking rule on the training pairs. The set coverage of the rule is determined by looking at how many record pairs in the training set end up in the same block. This coverage is used in the BlockLearner to sort blocking rules in the greedy set_covering algorithm. :param sdf: a dataframe containing record pairs for training

Returns

The object itself

create_block_key(sdf: pyspark.sql.dataframe.DataFrame)pyspark.sql.dataframe.DataFrame

This method calculates and adds the block-key column to the input dataframe

Parameters

sdf – a dataframe with records that need to be matched

Returns

the dataframe with the block-key column