blocker module¶
-
class
spark_matcher.blocker.
BlockLearner
(blocking_rules: List[spark_matcher.blocker.blocking_rules.BlockingRule], recall: float, table_checkpointer: Optional[spark_matcher.table_checkpointer.TableCheckpointer] = None, verbose=0)¶ Bases:
object
Class to learn blocking rules from training data.
-
blocking_rules
¶ list of BlockingRule objects that are taken into account during block learning
-
recall
¶ the minimum required percentage of training pairs that are covered by the learned blocking rules
-
verbose
¶ set verbosity
-
fit
(sdf: pyspark.sql.dataframe.DataFrame) → spark_matcher.blocker.block_learner.BlockLearner¶ This method fits, i.e. learns, the blocking rules that are needed to cover recall percent of the training set pairs. The fitting is done by solving the set-cover problem. It is solved by using a greedy algorithm.
- Parameters
sdf – a labelled training set containing pairs.
- Returns
the object itself
-
transform
(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → Union[pyspark.sql.dataframe.DataFrame, Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame]]¶ This method adds the block-keys to the input dataframes. It applies all the learned blocking rules on the input data and unifies the results. The result of this method is/are the input dataframe(s) containing the block-keys from the learned blocking rules.
- Parameters
sdf_1 – dataframe containing records
sdf_2 – dataframe containing records
- Returns
dataframe(s) containing block-keys from the learned blocking-rules
-
-
class
spark_matcher.blocker.
BlockingRule
(blocking_column: str)¶ Bases:
abc.ABC
Abstract class for blocking rules. This class contains all the base functionality for blocking rules.
-
blocking_column
¶ the column on which the BlockingRule is applied
-
calculate_training_set_coverage
(sdf: pyspark.sql.dataframe.DataFrame) → spark_matcher.blocker.blocking_rules.BlockingRule¶ This method calculate the set coverage of the blocking rule on the training pairs. The set coverage of the rule is determined by looking at how many record pairs in the training set end up in the same block. This coverage is used in the BlockLearner to sort blocking rules in the greedy set_covering algorithm. :param sdf: a dataframe containing record pairs for training
- Returns
The object itself
-
create_block_key
(sdf: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame¶ This method calculates and adds the block-key column to the input dataframe
- Parameters
sdf – a dataframe with records that need to be matched
- Returns
the dataframe with the block-key column
-