Spark-Matcher example

This notebook shows how to use the spark_matcher. First we create a Spark session:

[ ]:
%config Completer.use_jedi = False  # for proper autocompletion
from pyspark.sql import SparkSession
[ ]:
spark = (SparkSession
             .builder
             .master("local")
             .enableHiveSupport()
             .getOrCreate())
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/24 13:44:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In this notebook we use some example data that comes with spark_matcher

[ ]:
from spark_matcher.data import load_data
[ ]:
a, b = load_data(spark)

The two dataframes a and b both contain records North Carolina Voter Registry data. For each person there is a name, suburb and postcode:

[ ]:
a.limit(2).toPandas()
name suburb postcode
0 kiera matthews charlotte 28216
1 khimerc thomas charlotte 2826g
[ ]:
b.limit(2).toPandas()
name suburb postcode
0 kiea matthews charlotte 28218
1 chimerc thmas chaflotte 28269

We use the spark_matcher to link the records in dataframe a with the records in dataframe b. First import the Matcher and create an instance. The fields that are used for matching are given as the col_names argument:

[ ]:
from spark_matcher.matcher import Matcher
[ ]:
myMatcher = Matcher(spark, col_names=['name', 'suburb', 'postcode'])

Now we are ready for fitting the Matcher object using ‘active learning’; this means that the user has to enter whether a pair is a match or not. You enter ‘y’ if a pair is a match or ‘n’ when a pair is not a match. You will be notified when the model has converged and you can stop training by pressing ‘f’.

[ ]:
myMatcher.fit(a, b)


Nr. 1 (0+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: kiea matthews
suburb_1: charlotte
postcode_1: 28218

name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218

 y

Nr. 2 (1+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: khimerc thomas
suburb_1: charlotte
postcode_1: 2826g

name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218

 n

Nr. 3 (1+/1-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: john bentzen
suburb_1: waxhaw
postcode_1: 28173

name_2: john hanegraaff
suburb_2: waxhaw
postcode_2: 28173

 n

Nr. 4 (1+/2-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: willie greene
suburb_1: mooresville
postcode_1: 28115

name_2: lois greene
suburb_2: mooresboro
postcode_2: 28114

 n

Nr. 5 (1+/3-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: jennifer hannen
suburb_1: greensboro
postcode_1: 27405

name_2: jennifer bentz
suburb_2: greensboro
postcode_2: 27407

 n

Nr. 6 (1+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: crystal boone
suburb_1: green mountain
postcode_1: 28740

name_2: crystnal boone
suburb_2: green mountain
postcode_2: 28750

 y

Nr. 7 (2+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: latonja yarborovgh
suburb_1: knightdale
postcode_1: 27945

name_2: latonja yarborough
suburb_2: knivhtdale
postcode_2: 2754s

 y

Nr. 8 (3+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: jerome oliveah
suburb_1: selms
postcode_1: 27576

name_2: jerome oliver
suburb_2: selma
postcode_2: 27576

 y

Nr. 9 (4+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: latoyw oneal
suburb_1: smihtfield
postcode_1: 27577

name_2: lato6a oneal
suburb_2: smithfield
postcode_2: 27537

 y

Nr. 10 (5+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: patricia adams
suburb_1: rocky mount
postcode_1: 27804

name_2: patricia barus
suburb_2: valdese
postcode_2: 28690

 n

Nr. 11 (5+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: kimberly gay
suburb_1: kinston
postcode_1: 28504

name_2: kimbeahly gav
suburb_2: kinston
postcode_2: 28504

 y
Classifier converged, enter 'f' to stop training

Nr. 12 (6+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: de1>ra benf:eld
suburb_1: concord
postcode_1: 28025

name_2: debra benfield
suburb_2: concord
postcode_2: 28025

 y
Classifier converged, enter 'f' to stop training

Nr. 13 (7+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: ruth edwards
suburb_1: henderson
postcode_1: 27536

name_2: raechaun edwards
suburb_2: lillington
postcode_2: 27546

 f
<spark_matcher.matcher.matcher.Matcher at 0x7fab47c8e610>

The Matcher is now trained and can be used to predict on all data. This can be the data used for training or new data that was not seen by the model yet. By default the threshold is 0.5. A lower threshold results in more matches but also in more incorrect matches. By setting the top_n you can choose how many matches you want to see if there are more than one matche for a particular record.

[ ]:
result = myMatcher.predict(a, b, threshold=0.5, top_n=3)

Now let’s have a look at the results:

[ ]:
result_pdf = result.toPandas()

[ ]:
result_pdf.sort_values('score')
name_1 suburb_1 postcode_1 name_2 suburb_2 postcode_2 score
1011 teresina fontana newport 28570 teres'lna fontana newport 28571 0.802804
721 melissa wa5d greensboro 274|0 melissa ward greensboro 27410 0.954172
1026 thomas dy5on statesville z8677 thomas dyson statesvile 28697 0.958674
512 judirh coile charlotte 2821q judith coile charlott 28224 0.959523
340 ge0ffrey ryan wilmington z8403 geoffrey ryan wilmnigton 28400 0.962752
... ... ... ... ... ... ... ...
368 helem farmer lincolnton 2809z helen farmer lincolnton 28092 1.000000
367 heather stewart concord 28027 heather stewart concord 28027 1.000000
366 heather caywood charlotte 28278 heather cayw0od char1otte 28278 1.000000
372 henry wyatt waynesville 28786 henty wyatt waynesvillte 28786 1.000000
1105 zedekiah dawkins hikh point 27z60 zedekiah dawkins hikh point 27z60 1.000000

1106 rows × 7 columns

If you want to use the Matcher later without having to retrain, you can save the Matcher and load it later:

[ ]:
myMatcher.save('myMatcher.pkl')
[ ]:
myRestoredMatcher = Matcher(spark)
[ ]:
myRestoredMatcher.load('myMatcher.pkl')

This Matcher object can be used to predict on new data.