blocker module¶

class spark_matcher.blocker.BlockLearner(blocking_rules: List[spark_matcher.blocker.blocking_rules.BlockingRule], recall: float, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, verbose=0)¶

Bases: object

Class to learn blocking rules from training data.

blocking_rules¶: list of BlockingRule objects that are taken into account during block learning

recall¶: the minimum required percentage of training pairs that are covered by the learned blocking rules

verbose¶: set verbosity

fit(sdf: pyspark.sql.dataframe.DataFrame) → spark_matcher.blocker.block_learner.BlockLearner¶

This method fits, i.e. learns, the blocking rules that are needed to cover recall percent of the training set pairs. The fitting is done by solving the set-cover problem. It is solved by using a greedy algorithm.

Parameters: sdf – a labelled training set containing pairs.
Returns: the object itself

transform(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → Union[pyspark.sql.dataframe.DataFrame, Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame]]¶

This method adds the block-keys to the input dataframes. It applies all the learned blocking rules on the input data and unifies the results. The result of this method is/are the input dataframe(s) containing the block-keys from the learned blocking rules.

Parameters

sdf_1 – dataframe containing records
sdf_2 – dataframe containing records

Returns

dataframe(s) containing block-keys from the learned blocking-rules

class spark_matcher.blocker.BlockingRule(blocking_column: str)¶

Bases: abc.ABC

Abstract class for blocking rules. This class contains all the base functionality for blocking rules.

blocking_column¶: the column on which the BlockingRule is applied

calculate_training_set_coverage(sdf: pyspark.sql.dataframe.DataFrame) → spark_matcher.blocker.blocking_rules.BlockingRule¶

This method calculate the set coverage of the blocking rule on the training pairs. The set coverage of the rule is determined by looking at how many record pairs in the training set end up in the same block. This coverage is used in the BlockLearner to sort blocking rules in the greedy set_covering algorithm. :param sdf: a dataframe containing record pairs for training

Returns: The object itself

create_block_key(sdf: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame¶

This method calculates and adds the block-key column to the input dataframe

Parameters: sdf – a dataframe with records that need to be matched
Returns: the dataframe with the block-key column