similarity_metrics module

class spark_matcher.similarity_metrics.SimilarityMetrics(field_info: Dict)

Bases: object

Class to calculate similarity metrics for pairs of records. The field_info dict contains column names as keys and lists of similarity functions as values. E.g.

field_info = {‘name’: [token_set_ratio, token_sort_ratio],

‘postcode’: [ratio]}

where token_set_ratio, token_sort_ratio and ratio are string similarity functions that take two strings as arguments and return a numeric value

field_info

dict containing column names as keys and lists of similarity functions as values

transform(pairs_table: pyspark.sql.dataframe.DataFrame)pyspark.sql.dataframe.DataFrame

Method to apply similarity metrics to pairs table. Method makes use of method dispatching to facilitate both Pandas and Spark dataframes

Parameters

pairs_table – Spark or Pandas dataframe containing pairs table

Returns

Pandas or Spark dataframe with pairs table and newly created similarity_metrics column