clustering¶

deduplipy.clustering.fill_missing_links(matrix, convergence_threshold=0.01)¶

Fill missing values in adjacency matrix using SoftImpute. Missing values are considered to be zero, as this is the default of the nx.to_numpy_matrix function when there is no edge between two nodes.

Parameters

matrix – adjacency matrix
convergence_threshold – convergence threshold for SoftImpute algorithm

Returns

Numpy adjacency matrix with imputed missing values

deduplipy.clustering.hierarchical_clustering(scored_pairs_table: pandas.core.frame.DataFrame, col_names: List, cluster_threshold: float = 0.5, fill_missing=True) → pandas.core.frame.DataFrame¶

Apply hierarchical clustering to scored_pairs_table and perform the actual deduplication by adding a cluster id to each record

Parameters

scored_pairs_table – Pandas dataframe containg all pairs and the similarity probability score
col_names – name to use for deduplication
cluster_threshold – threshold to apply in hierarchical clustering
fill_missing – whether to impute missing values in the adjacency matrix using softimpute, otherwise missing values in the adjacency matrix are filled with zeros

Returns

Pandas dataframe containing records with cluster id