Tutorial¶

Simple deduplication¶

[7]:

import pandas as pd
from deduplipy.datasets import load_data

Load data

[8]:

df_train = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'

[9]:

df_train.head(2)

[9]:

	name	suburb	postcode
0	khimerc thomas	charlotte	2826g
1	lucille richardst	kannapolis	28o81

Import Deduplicator class

[10]:

from deduplipy.deduplicator import Deduplicator

Instantiate Deduplicator class with the column names

[11]:

myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Perform the fitting using active learning

[ ]:

myDedupliPy.fit(df_train)

Predict on new data

[13]:

res = myDedupliPy.predict(df_train)
res.sort_values('deduplication_id').head(10)

[13]:

	name	suburb	postcode	deduplication_id
0	khimerc thomas	charlotte	2826g	1
1190	chimerc thomas	charlotte	28269	1
1302	chimerc thmas	chaflotte	28269	1
1255	kimberly craddock	charlotte	28214	4
15	kimbefly craddock	charlotte	28264	4
1313	kimberly craddoclc	charlotte	282\|4	4
39	l douglas loujdin	charlotte	28225	7
1139	l douglas loudin	charlotte	28205	7
423	timothy lowder	charlotte	28227	9
1564	timothy lowder	cbarlotte	282z7	9

The Deduplicator instance can be saved as a pickle file and be applied on new data after training:

[14]:

import pickle

[15]:

with open('myDeduplipy.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

[16]:

with open('myDeduplipy.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

[17]:

res = loaded_obj.predict(df_train)
res.sort_values('deduplication_id').head(10)

[17]:

	name	suburb	postcode	deduplication_id
0	khimerc thomas	charlotte	2826g	1
1190	chimerc thomas	charlotte	28269	1
1302	chimerc thmas	chaflotte	28269	1
1255	kimberly craddock	charlotte	28214	4
15	kimbefly craddock	charlotte	28264	4
1313	kimberly craddoclc	charlotte	282\|4	4
39	l douglas loujdin	charlotte	28225	7
1139	l douglas loudin	charlotte	28205	7
423	timothy lowder	charlotte	28227	9
1564	timothy lowder	cbarlotte	282z7	9

Advanced deduplication¶

Load your data. In this example we take a sample dataset that comes with DedupliPy:

[18]:

from deduplipy.datasets import load_data

[19]:

df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'

Create a Deduplicator instance and provide advanced settings

The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

[20]:

from deduplipy.deduplicator import Deduplicator
from thefuzz.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

[21]:

field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}

We choose our own set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column

[22]:

def first_two_characters(x):
    return x[:2]

interaction=True makes the classifier include interaction features, e.g. ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
We set verbose=1 to get information on the progress and a distribution of scores

[23]:

myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules={'name': [first_two_characters]}, verbose=1)

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

[ ]:

myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

[25]:

res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)

blocking started
blocking finished
Nr of pairs: 27350
scoring started
scoring finished
Nr of filtered pairs: 954
Clustering started
Clustering finished

[25]:

	name	suburb	postcode	deduplication_id
1	lucille richardst	kannapolis	28o81	1
1194	lucille richards	kannapolis	28081	1
604	lutta baldwin	whiteville	28472	3
995	lutta baldwin	whitevill	28475	3
2	reb3cca bauerboand	raleigh	27615	5
1134	rebecca bauerband	raleigh	27615	5
1024	rebecca harrell	witnon	27926	7
1456	rebecca harrell	winton	27986	7
92	repecca harrell	winton	27q86	7
675	rebeccah shelton	whittier	28789	10