Tutorial

Simple deduplication

[7]:
import pandas as pd
from deduplipy.datasets import load_data

Load data

[8]:
df_train = load_data(kind='voters')
Column names: 'name', 'suburb', 'postcode'
[9]:
df_train.head(2)
[9]:
name suburb postcode
0 khimerc thomas charlotte 2826g
1 lucille richardst kannapolis 28o81

Import Deduplicator class

[10]:
from deduplipy.deduplicator import Deduplicator

Instantiate Deduplicator class with the column names

[11]:
myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Perform the fitting using active learning

[ ]:
myDedupliPy.fit(df_train)

Predict on new data

[13]:
res = myDedupliPy.predict(df_train)
res.sort_values('deduplication_id').head(10)
[13]:
name suburb postcode deduplication_id
0 khimerc thomas charlotte 2826g 1
1190 chimerc thomas charlotte 28269 1
1302 chimerc thmas chaflotte 28269 1
1255 kimberly craddock charlotte 28214 4
15 kimbefly craddock charlotte 28264 4
1313 kimberly craddoclc charlotte 282|4 4
39 l douglas loujdin charlotte 28225 7
1139 l douglas loudin charlotte 28205 7
423 timothy lowder charlotte 28227 9
1564 timothy lowder cbarlotte 282z7 9

The Deduplicator instance can be saved as a pickle file and be applied on new data after training:

[14]:
import pickle
[15]:
with open('myDeduplipy.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)
[16]:
with open('myDeduplipy.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)
[17]:
res = loaded_obj.predict(df_train)
res.sort_values('deduplication_id').head(10)
[17]:
name suburb postcode deduplication_id
0 khimerc thomas charlotte 2826g 1
1190 chimerc thomas charlotte 28269 1
1302 chimerc thmas chaflotte 28269 1
1255 kimberly craddock charlotte 28214 4
15 kimbefly craddock charlotte 28264 4
1313 kimberly craddoclc charlotte 282|4 4
39 l douglas loujdin charlotte 28225 7
1139 l douglas loudin charlotte 28205 7
423 timothy lowder charlotte 28227 9
1564 timothy lowder cbarlotte 282z7 9

Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

[18]:
from deduplipy.datasets import load_data
[19]:
df = load_data(kind='voters')
Column names: 'name', 'suburb', 'postcode'

Create a Deduplicator instance and provide advanced settings

  • The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

[20]:
from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio
[21]:
field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}
  • We choose our own set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column

[22]:
def first_two_characters(x):
    return x[:2]
  • interaction=True makes the classifier include interaction features, e.g. ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.

  • We set verbose=1 to get information on the progress and a distribution of scores

[23]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules={'name': [first_two_characters]}, verbose=1)

Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.

[ ]:
myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.

[25]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)
blocking started
blocking finished
Nr of pairs: 27350
scoring started
scoring finished
Nr of filtered pairs: 954
Clustering started
Clustering finished
[25]:
name suburb postcode deduplication_id
1 lucille richardst kannapolis 28o81 1
1194 lucille richards kannapolis 28081 1
604 lutta baldwin whiteville 28472 3
995 lutta baldwin whitevill 28475 3
2 reb3cca bauerboand raleigh 27615 5
1134 rebecca bauerband raleigh 27615 5
1024 rebecca harrell witnon 27926 7
1456 rebecca harrell winton 27986 7
92 repecca harrell winton 27q86 7
675 rebeccah shelton whittier 28789 10