sampler module¶
-
class
spark_matcher.sampler.
HashSampler
(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int, threshold: float = 0.5, num_hash_tables: int = 10)¶ Bases:
spark_matcher.sampler.training_sampler.Sampler
-
create_pairs_table
(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → pyspark.sql.dataframe.DataFrame¶ Create hashed pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.
- Parameters
sdf_1 – Spark dataframe containing the first table to with the input should be matched
sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table
- Returns
Spark dataframe that contains sampled pairs selected with MinHashLSH technique
-
static
is_non_zero_vector
(vector)¶ Check if a vector has at least 1 non-zero entry. This function can deal with dense or sparse vectors. This is needed as the VectorAssembler can return dense or sparse vectors, dependent on what is more memory efficient.
- Parameters
vector – vector
- Returns
boolean whether a vector has at least 1 non-zero entry
-
-
class
spark_matcher.sampler.
RandomSampler
(table_checkpointer: spark_matcher.table_checkpointer.TableCheckpointer, col_names: List[str], n_train_samples: int)¶ Bases:
spark_matcher.sampler.training_sampler.Sampler
-
create_pairs_table
(sdf_1: pyspark.sql.dataframe.DataFrame, sdf_2: Optional[pyspark.sql.dataframe.DataFrame] = None) → pyspark.sql.dataframe.DataFrame¶ Create random pairs that are used for training. sdf_2 is only required for record matching, for deduplication only sdf_1 is required.
- Parameters
sdf_1 – Spark dataframe containing the first table to with the input should be matched
sdf_2 – Optional: Spark dataframe containing the second table that should be matched to the first table
- Returns
Spark dataframe that contains randomly sampled pairs
-