similarity_metrics module¶
-
class
spark_matcher.similarity_metrics.
SimilarityMetrics
(field_info: Dict)¶ Bases:
object
Class to calculate similarity metrics for pairs of records. The field_info dict contains column names as keys and lists of similarity functions as values. E.g.
- field_info = {‘name’: [token_set_ratio, token_sort_ratio],
‘postcode’: [ratio]}
where token_set_ratio, token_sort_ratio and ratio are string similarity functions that take two strings as arguments and return a numeric value
-
field_info
¶ dict containing column names as keys and lists of similarity functions as values
-
transform
(pairs_table: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame¶ Method to apply similarity metrics to pairs table. Method makes use of method dispatching to facilitate both Pandas and Spark dataframes
- Parameters
pairs_table – Spark or Pandas dataframe containing pairs table
- Returns
Pandas or Spark dataframe with pairs table and newly created similarity_metrics column