sampler module¶

class spark_matcher.sampler.HashSampler(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int, threshold: float = 0.5, num_hash_tables: int = 10)¶

Bases: spark_matcher.sampler.training_sampler.Sampler

create_pairs_table(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → pyspark.sql.dataframe.DataFrame¶

Create hashed pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.

Parameters

sdf_1 – Spark dataframe containing the first table to with the input should be matched
sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table

Returns

Spark dataframe that contains sampled pairs selected with MinHashLSH technique

static is_non_zero_vector(vector)¶

Check if a vector has at least 1 non-zero entry. This function can deal with dense or sparse vectors. This is needed as the VectorAssembler can return dense or sparse vectors, dependent on what is more memory efficient.

Parameters: vector – vector
Returns: boolean whether a vector has at least 1 non-zero entry

class spark_matcher.sampler.RandomSampler(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int)¶

Bases: spark_matcher.sampler.training_sampler.Sampler

create_pairs_table(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → pyspark.sql.dataframe.DataFrame¶

Create random pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.

Parameters

sdf_1 – Spark dataframe containing the first table to with the input should be matched
sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table

Returns

Spark dataframe that contains randomly sampled pairs