deduplicator

class deduplipy.deduplicator.Deduplicator(col_names=None, field_info=None, interaction=False, rules=None, recall=1.0, save_intermediate_steps=False, verbose=0)

Bases: object

fit(X, n_samples=10000)

Fit the deduplicator instance

Args:

X: Pandas dataframe to be used for fitting n_samples: number of pairs to be created for active learning

Returns: trained deduplicator instance

predict(X, score_threshold=0.1)

Predict on new data using the trained deduplicator.

Args:

X: Pandas dataframe with column as used when fitting deduplicator instance score_threshold: Classification threshold to use for filtering before starting hierarchical clustering

Returns: Pandas dataframe with a new column deduplication_id. Rows with the same deduplication_id are deduplicated.