Spark-Matcher example¶

This notebook shows how to use the spark_matcher. First we create a Spark session:

[ ]:

%config Completer.use_jedi = False  # for proper autocompletion
from pyspark.sql import SparkSession

[ ]:

spark = (SparkSession
             .builder
             .master("local")
             .enableHiveSupport()
             .getOrCreate())

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/24 13:44:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In this notebook we use some example data that comes with spark_matcher

[ ]:

from spark_matcher.data import load_data

[ ]:

a, b = load_data(spark)

The two dataframes a and b both contain records North Carolina Voter Registry data. For each person there is a name, suburb and postcode:

[ ]:

a.limit(2).toPandas()

	name	suburb	postcode
0	kiera matthews	charlotte	28216
1	khimerc thomas	charlotte	2826g

[ ]:

b.limit(2).toPandas()

	name	suburb	postcode
0	kiea matthews	charlotte	28218
1	chimerc thmas	chaflotte	28269

We use the spark_matcher to link the records in dataframe a with the records in dataframe b. First import the Matcher and create an instance. The fields that are used for matching are given as the col_names argument:

[ ]:

from spark_matcher.matcher import Matcher

[ ]:

myMatcher = Matcher(spark, col_names=['name', 'suburb', 'postcode'])

Now we are ready for fitting the Matcher object using ‘active learning’; this means that the user has to enter whether a pair is a match or not. You enter ‘y’ if a pair is a match or ‘n’ when a pair is not a match. You will be notified when the model has converged and you can stop training by pressing ‘f’.

[ ]:

myMatcher.fit(a, b)


Nr. 1 (0+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: kiea matthews
suburb_1: charlotte
postcode_1: 28218

name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218


Nr. 2 (1+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: khimerc thomas
suburb_1: charlotte
postcode_1: 2826g

name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218


Nr. 3 (1+/1-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: john bentzen
suburb_1: waxhaw
postcode_1: 28173

name_2: john hanegraaff
suburb_2: waxhaw
postcode_2: 28173


Nr. 4 (1+/2-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: willie greene
suburb_1: mooresville
postcode_1: 28115

name_2: lois greene
suburb_2: mooresboro
postcode_2: 28114


Nr. 5 (1+/3-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: jennifer hannen
suburb_1: greensboro
postcode_1: 27405

name_2: jennifer bentz
suburb_2: greensboro
postcode_2: 27407


Nr. 6 (1+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: crystal boone
suburb_1: green mountain
postcode_1: 28740

name_2: crystnal boone
suburb_2: green mountain
postcode_2: 28750


Nr. 7 (2+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: latonja yarborovgh
suburb_1: knightdale
postcode_1: 27945

name_2: latonja yarborough
suburb_2: knivhtdale
postcode_2: 2754s


Nr. 8 (3+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: jerome oliveah
suburb_1: selms
postcode_1: 27576

name_2: jerome oliver
suburb_2: selma
postcode_2: 27576


Nr. 9 (4+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: latoyw oneal
suburb_1: smihtfield
postcode_1: 27577

name_2: lato6a oneal
suburb_2: smithfield
postcode_2: 27537


Nr. 10 (5+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: patricia adams
suburb_1: rocky mount
postcode_1: 27804

name_2: patricia barus
suburb_2: valdese
postcode_2: 28690


Nr. 11 (5+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: kimberly gay
suburb_1: kinston
postcode_1: 28504

name_2: kimbeahly gav
suburb_2: kinston
postcode_2: 28504

Classifier converged, enter 'f' to stop training

Nr. 12 (6+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: de1>ra benf:eld
suburb_1: concord
postcode_1: 28025

name_2: debra benfield
suburb_2: concord
postcode_2: 28025

Classifier converged, enter 'f' to stop training

Nr. 13 (7+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish

name_1: ruth edwards
suburb_1: henderson
postcode_1: 27536

name_2: raechaun edwards
suburb_2: lillington
postcode_2: 27546

<spark_matcher.matcher.matcher.Matcher at 0x7fab47c8e610>

The Matcher is now trained and can be used to predict on all data. This can be the data used for training or new data that was not seen by the model yet. By default the threshold is 0.5. A lower threshold results in more matches but also in more incorrect matches. By setting the top_n you can choose how many matches you want to see if there are more than one matche for a particular record.

[ ]:

result = myMatcher.predict(a, b, threshold=0.5, top_n=3)

Now let’s have a look at the results:

[ ]:

result_pdf = result.toPandas()

[ ]:

result_pdf.sort_values('score')

	name_1	suburb_1	postcode_1	name_2	suburb_2	postcode_2	score
1011	teresina fontana	newport	28570	teres'lna fontana	newport	28571	0.802804
721	melissa wa5d	greensboro	274\|0	melissa ward	greensboro	27410	0.954172
1026	thomas dy5on	statesville	z8677	thomas dyson	statesvile	28697	0.958674
512	judirh coile	charlotte	2821q	judith coile	charlott	28224	0.959523
340	ge0ffrey ryan	wilmington	z8403	geoffrey ryan	wilmnigton	28400	0.962752
...	...	...	...	...	...	...	...
368	helem farmer	lincolnton	2809z	helen farmer	lincolnton	28092	1.000000
367	heather stewart	concord	28027	heather stewart	concord	28027	1.000000
366	heather caywood	charlotte	28278	heather cayw0od	char1otte	28278	1.000000
372	henry wyatt	waynesville	28786	henty wyatt	waynesvillte	28786	1.000000
1105	zedekiah dawkins	hikh point	27z60	zedekiah dawkins	hikh point	27z60	1.000000

1106 rows × 7 columns

If you want to use the Matcher later without having to retrain, you can save the Matcher and load it later:

[ ]:

myMatcher.save('myMatcher.pkl')

[ ]:

myRestoredMatcher = Matcher(spark)

[ ]:

myRestoredMatcher.load('myMatcher.pkl')

This Matcher object can be used to predict on new data.