Tutorial¶

Simple deduplication¶

[1]:

import pandas as pd
from deduplipy.datasets import load_data

Load data

[2]:

df_train = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'

[3]:

df_train.head(2)

[3]:

	name	suburb	postcode
0	khimerc thomas	charlotte	2826g
1	lucille richardst	kannapolis	28o81

Import Deduplicator class

[4]:

from deduplipy.deduplicator import Deduplicator

Instantiate Deduplicator class with the column names

[5]:

myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Perform the fitting using active learning

[ ]:

myDedupliPy.fit(df_train)

Predict on new data

[7]:

res = myDedupliPy.predict(df_train)
res.sort_values('deduplication_id').head(10)

[7]:

	name	suburb	postcode	deduplication_id
252	kiera matthews	charlotte	28216	1
1380	kiea matthews	charlotte	28218	1
0	khimerc thomas	charlotte	2826g	2
1190	chimerc thomas	charlotte	28269	2
1302	chimerc thmas	chaflotte	28269	2
1255	kimberly craddock	charlotte	28214	6
15	kimbefly craddock	charlotte	28264	6
1313	kimberly craddoclc	charlotte	282\|4	6
39	l douglas loujdin	charlotte	28225	9
1139	l douglas loudin	charlotte	28205	9

The Deduplicator instance can be saved as a pickle file and be applied on new data after training:

[8]:

import pickle

[9]:

with open('myDeduplipy.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

[11]:

with open('myDeduplipy.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

[13]:

res = loaded_obj.predict(df_train)
res.sort_values('deduplication_id').head(10)

[13]:

	name	suburb	postcode	deduplication_id
252	kiera matthews	charlotte	28216	1
1380	kiea matthews	charlotte	28218	1
0	khimerc thomas	charlotte	2826g	2
1190	chimerc thomas	charlotte	28269	2
1302	chimerc thmas	chaflotte	28269	2
1255	kimberly craddock	charlotte	28214	6
15	kimbefly craddock	charlotte	28264	6
1313	kimberly craddoclc	charlotte	282\|4	6
39	l douglas loujdin	charlotte	28225	9
1139	l douglas loudin	charlotte	28205	9

Advanced deduplication¶

Load your data. In this example we take a sample dataset that comes with DedupliPy:

[14]:

from deduplipy.datasets import load_data

[15]:

df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'

Create a Deduplicator instance and provide advanced settings

The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

[16]:

from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

[17]:

field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}

We choose our own set of rules for blocking which we define ourselves.

[18]:

def first_two_characters(x):
    return x[:2]

interaction=True makes the classifier include interaction features, e.g. ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
We set verbose=1 to get information on the progress and a distribution of scores

[19]:

myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules = [first_two_characters], verbose=1)

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

[ ]:

myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

[21]:

res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)

blocking started
blocking finished
Nr of pairs: 748767
scoring started
scoring finished
Nr of filtered pairs: 944
Clustering started
Clustering finished

[21]:

	name	suburb	postcode	deduplication_id
0	khimerc thomas	charlotte	2826g	1
1190	chimerc thomas	charlotte	28269	1
1302	chimerc thmas	chaflotte	28269	1
1	lucille richardst	kannapolis	28o81	4
1194	lucille richards	kannapolis	28081	4
1449	darryl perry	fayetteville	28301	6
5	darr6l perry	fayetteville	28321	6
966	daryl perry	fayetteville	2830l	6
7	judith gerde5	sunset beach	28q68	9
1188	judith gerdes	sunset beach	28468	9