clustering¶

deduplipy.clustering.hierarchical_clustering(scored_pairs_table, col_names, cluster_threshold=0.5)¶

Apply hierarchical clustering to scored_pairs_table and perform the actual deduplication by adding a cluster id to each record

Args:: scored_pairs_table: Pandas dataframe containg all pairs and the similarity probability score col_names: name to use for deduplication cluster_threshold: threshold to apply in hierarchical clustering
Returns:: Pandas dataframe containing records with cluster id

Read the Docs v: v0.2

Versions: latest; stable; v0.2; v0.1.2

Downloads

On Read the Docs: Project Home; Builds