sampler module

class spark_matcher.sampler.HashSampler(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int, threshold: float = 0.5, num_hash_tables: int = 10)

Bases: spark_matcher.sampler.training_sampler.Sampler

create_pairs_table(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None)pyspark.sql.dataframe.DataFrame

Create hashed pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.

Parameters
  • sdf_1 – Spark dataframe containing the first table to with the input should be matched

  • sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table

Returns

Spark dataframe that contains sampled pairs selected with MinHashLSH technique

static is_non_zero_vector(vector)

Check if a vector has at least 1 non-zero entry. This function can deal with dense or sparse vectors. This is needed as the VectorAssembler can return dense or sparse vectors, dependent on what is more memory efficient.

Parameters

vector – vector

Returns

boolean whether a vector has at least 1 non-zero entry

class spark_matcher.sampler.RandomSampler(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int)

Bases: spark_matcher.sampler.training_sampler.Sampler

create_pairs_table(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None)pyspark.sql.dataframe.DataFrame

Create random pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.

Parameters
  • sdf_1 – Spark dataframe containing the first table to with the input should be matched

  • sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table

Returns

Spark dataframe that contains randomly sampled pairs