Tutorial¶
Simple deduplication¶
[7]:
import pandas as pd
from deduplipy.datasets import load_data
Load data
[8]:
df_train = load_data(kind='voters')
Column names: 'name', 'suburb', 'postcode'
[9]:
df_train.head(2)
[9]:
| name | suburb | postcode | |
|---|---|---|---|
| 0 | khimerc thomas | charlotte | 2826g |
| 1 | lucille richardst | kannapolis | 28o81 |
Import Deduplicator class
[10]:
from deduplipy.deduplicator import Deduplicator
Instantiate Deduplicator class with the column names
[11]:
myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])
Perform the fitting using active learning
[ ]:
myDedupliPy.fit(df_train)
Predict on new data
[13]:
res = myDedupliPy.predict(df_train)
res.sort_values('deduplication_id').head(10)
[13]:
| name | suburb | postcode | deduplication_id | |
|---|---|---|---|---|
| 0 | khimerc thomas | charlotte | 2826g | 1 |
| 1190 | chimerc thomas | charlotte | 28269 | 1 |
| 1302 | chimerc thmas | chaflotte | 28269 | 1 |
| 1255 | kimberly craddock | charlotte | 28214 | 4 |
| 15 | kimbefly craddock | charlotte | 28264 | 4 |
| 1313 | kimberly craddoclc | charlotte | 282|4 | 4 |
| 39 | l douglas loujdin | charlotte | 28225 | 7 |
| 1139 | l douglas loudin | charlotte | 28205 | 7 |
| 423 | timothy lowder | charlotte | 28227 | 9 |
| 1564 | timothy lowder | cbarlotte | 282z7 | 9 |
The Deduplicator instance can be saved as a pickle file and be applied on new data after training:
[14]:
import pickle
[15]:
with open('myDeduplipy.pkl', 'wb') as f:
pickle.dump(myDedupliPy, f)
[16]:
with open('myDeduplipy.pkl', 'rb') as f:
loaded_obj = pickle.load(f)
[17]:
res = loaded_obj.predict(df_train)
res.sort_values('deduplication_id').head(10)
[17]:
| name | suburb | postcode | deduplication_id | |
|---|---|---|---|---|
| 0 | khimerc thomas | charlotte | 2826g | 1 |
| 1190 | chimerc thomas | charlotte | 28269 | 1 |
| 1302 | chimerc thmas | chaflotte | 28269 | 1 |
| 1255 | kimberly craddock | charlotte | 28214 | 4 |
| 15 | kimbefly craddock | charlotte | 28264 | 4 |
| 1313 | kimberly craddoclc | charlotte | 282|4 | 4 |
| 39 | l douglas loujdin | charlotte | 28225 | 7 |
| 1139 | l douglas loudin | charlotte | 28205 | 7 |
| 423 | timothy lowder | charlotte | 28227 | 9 |
| 1564 | timothy lowder | cbarlotte | 282z7 | 9 |
Advanced deduplication¶
Load your data. In this example we take a sample dataset that comes with DedupliPy:
[18]:
from deduplipy.datasets import load_data
[19]:
df = load_data(kind='voters')
Column names: 'name', 'suburb', 'postcode'
Create a Deduplicator instance and provide advanced settings
The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.
[20]:
from deduplipy.deduplicator import Deduplicator
from thefuzz.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio
[21]:
field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}
We choose our own set of rules for blocking which we define ourselves. We only apply this rule to the ‘name’ column
[22]:
def first_two_characters(x):
return x[:2]
interaction=Truemakes the classifier include interaction features, e.g.ratio('name') * token_set_ratio('suburb'). When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.We set
verbose=1to get information on the progress and a distribution of scores
[23]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules={'name': [first_two_characters]}, verbose=1)
Fit the Deduplicator by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering ‘f’.
[ ]:
myDedupliPy.fit(df)
Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:
Apply the trained Deduplicator on (new) data. The column deduplication_id is the identifier for a cluster. Rows with the same deduplication_id are found to be the same real world entity.
[25]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)
blocking started
blocking finished
Nr of pairs: 27350
scoring started
scoring finished
Nr of filtered pairs: 954
Clustering started
Clustering finished
[25]:
| name | suburb | postcode | deduplication_id | |
|---|---|---|---|---|
| 1 | lucille richardst | kannapolis | 28o81 | 1 |
| 1194 | lucille richards | kannapolis | 28081 | 1 |
| 604 | lutta baldwin | whiteville | 28472 | 3 |
| 995 | lutta baldwin | whitevill | 28475 | 3 |
| 2 | reb3cca bauerboand | raleigh | 27615 | 5 |
| 1134 | rebecca bauerband | raleigh | 27615 | 5 |
| 1024 | rebecca harrell | witnon | 27926 | 7 |
| 1456 | rebecca harrell | winton | 27986 | 7 |
| 92 | repecca harrell | winton | 27q86 | 7 |
| 675 | rebeccah shelton | whittier | 28789 | 10 |