clustering¶
-
deduplipy.clustering.
hierarchical_clustering
(scored_pairs_table, col_names, cluster_threshold=0.5)¶ Apply hierarchical clustering to scored_pairs_table and perform the actual deduplication by adding a cluster id to each record
- Args:
scored_pairs_table: Pandas dataframe containg all pairs and the similarity probability score col_names: name to use for deduplication cluster_threshold: threshold to apply in hierarchical clustering
- Returns:
Pandas dataframe containing records with cluster id