Spark-Matcher example¶
This notebook shows how to use the spark_matcher
. First we create a Spark session:
[ ]:
%config Completer.use_jedi = False # for proper autocompletion
from pyspark.sql import SparkSession
[ ]:
spark = (SparkSession
.builder
.master("local")
.enableHiveSupport()
.getOrCreate())
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/24 13:44:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
In this notebook we use some example data that comes with spark_matcher
[ ]:
from spark_matcher.data import load_data
[ ]:
a, b = load_data(spark)
The two dataframes a
and b
both contain records North Carolina Voter Registry data. For each person there is a name, suburb and postcode:
[ ]:
a.limit(2).toPandas()
name | suburb | postcode | |
---|---|---|---|
0 | kiera matthews | charlotte | 28216 |
1 | khimerc thomas | charlotte | 2826g |
[ ]:
b.limit(2).toPandas()
name | suburb | postcode | |
---|---|---|---|
0 | kiea matthews | charlotte | 28218 |
1 | chimerc thmas | chaflotte | 28269 |
We use the spark_matcher
to link the records in dataframe a
with the records in dataframe b
. First import the Matcher
and create an instance. The fields that are used for matching are given as the col_names
argument:
[ ]:
from spark_matcher.matcher import Matcher
[ ]:
myMatcher = Matcher(spark, col_names=['name', 'suburb', 'postcode'])
Now we are ready for fitting the Matcher
object using ‘active learning’; this means that the user has to enter whether a pair is a match or not. You enter ‘y’ if a pair is a match or ‘n’ when a pair is not a match. You will be notified when the model has converged and you can stop training by pressing ‘f’.
[ ]:
myMatcher.fit(a, b)
Nr. 1 (0+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: kiea matthews
suburb_1: charlotte
postcode_1: 28218
name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218
y
Nr. 2 (1+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: khimerc thomas
suburb_1: charlotte
postcode_1: 2826g
name_2: kiea matthews
suburb_2: charlotte
postcode_2: 28218
n
Nr. 3 (1+/1-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: john bentzen
suburb_1: waxhaw
postcode_1: 28173
name_2: john hanegraaff
suburb_2: waxhaw
postcode_2: 28173
n
Nr. 4 (1+/2-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: willie greene
suburb_1: mooresville
postcode_1: 28115
name_2: lois greene
suburb_2: mooresboro
postcode_2: 28114
n
Nr. 5 (1+/3-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: jennifer hannen
suburb_1: greensboro
postcode_1: 27405
name_2: jennifer bentz
suburb_2: greensboro
postcode_2: 27407
n
Nr. 6 (1+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: crystal boone
suburb_1: green mountain
postcode_1: 28740
name_2: crystnal boone
suburb_2: green mountain
postcode_2: 28750
y
Nr. 7 (2+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: latonja yarborovgh
suburb_1: knightdale
postcode_1: 27945
name_2: latonja yarborough
suburb_2: knivhtdale
postcode_2: 2754s
y
Nr. 8 (3+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: jerome oliveah
suburb_1: selms
postcode_1: 27576
name_2: jerome oliver
suburb_2: selma
postcode_2: 27576
y
Nr. 9 (4+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: latoyw oneal
suburb_1: smihtfield
postcode_1: 27577
name_2: lato6a oneal
suburb_2: smithfield
postcode_2: 27537
y
Nr. 10 (5+/4-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: patricia adams
suburb_1: rocky mount
postcode_1: 27804
name_2: patricia barus
suburb_2: valdese
postcode_2: 28690
n
Nr. 11 (5+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: kimberly gay
suburb_1: kinston
postcode_1: 28504
name_2: kimbeahly gav
suburb_2: kinston
postcode_2: 28504
y
Classifier converged, enter 'f' to stop training
Nr. 12 (6+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: de1>ra benf:eld
suburb_1: concord
postcode_1: 28025
name_2: debra benfield
suburb_2: concord
postcode_2: 28025
y
Classifier converged, enter 'f' to stop training
Nr. 13 (7+/5-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
name_1: ruth edwards
suburb_1: henderson
postcode_1: 27536
name_2: raechaun edwards
suburb_2: lillington
postcode_2: 27546
f
<spark_matcher.matcher.matcher.Matcher at 0x7fab47c8e610>
The Matcher
is now trained and can be used to predict on all data. This can be the data used for training or new data that was not seen by the model yet. By default the threshold
is 0.5. A lower threshold results in more matches but also in more incorrect matches. By setting the top_n
you can choose how many matches you want to see if there are more than one matche for a particular record.
[ ]:
result = myMatcher.predict(a, b, threshold=0.5, top_n=3)
Now let’s have a look at the results:
[ ]:
result_pdf = result.toPandas()
[ ]:
result_pdf.sort_values('score')
name_1 | suburb_1 | postcode_1 | name_2 | suburb_2 | postcode_2 | score | |
---|---|---|---|---|---|---|---|
1011 | teresina fontana | newport | 28570 | teres'lna fontana | newport | 28571 | 0.802804 |
721 | melissa wa5d | greensboro | 274|0 | melissa ward | greensboro | 27410 | 0.954172 |
1026 | thomas dy5on | statesville | z8677 | thomas dyson | statesvile | 28697 | 0.958674 |
512 | judirh coile | charlotte | 2821q | judith coile | charlott | 28224 | 0.959523 |
340 | ge0ffrey ryan | wilmington | z8403 | geoffrey ryan | wilmnigton | 28400 | 0.962752 |
... | ... | ... | ... | ... | ... | ... | ... |
368 | helem farmer | lincolnton | 2809z | helen farmer | lincolnton | 28092 | 1.000000 |
367 | heather stewart | concord | 28027 | heather stewart | concord | 28027 | 1.000000 |
366 | heather caywood | charlotte | 28278 | heather cayw0od | char1otte | 28278 | 1.000000 |
372 | henry wyatt | waynesville | 28786 | henty wyatt | waynesvillte | 28786 | 1.000000 |
1105 | zedekiah dawkins | hikh point | 27z60 | zedekiah dawkins | hikh point | 27z60 | 1.000000 |
1106 rows × 7 columns
If you want to use the Matcher
later without having to retrain, you can save the Matcher
and load it later:
[ ]:
myMatcher.save('myMatcher.pkl')
[ ]:
myRestoredMatcher = Matcher(spark)
[ ]:
myRestoredMatcher.load('myMatcher.pkl')
This Matcher
object can be used to predict on new data.