clustering¶
-
deduplipy.clustering.
fill_missing_links
(matrix, convergence_threshold=0.01)¶ Fill missing values in adjacency matrix using SoftImpute. Missing values are considered to be zero, as this is the default of the nx.to_numpy_matrix function when there is no edge between two nodes.
- Args:
matrix: adjacency matrix convergence_threshold: convergence threshold for SoftImpute algorithm
- Returns:
Numpy adjacency matrix with imputed missing values
-
deduplipy.clustering.
hierarchical_clustering
(scored_pairs_table: pandas.core.frame.DataFrame, col_names: List, cluster_threshold: float = 0.5, fill_missing=True) → pandas.core.frame.DataFrame¶ Apply hierarchical clustering to scored_pairs_table and perform the actual deduplication by adding a cluster id to each record
- Args:
scored_pairs_table: Pandas dataframe containg all pairs and the similarity probability score col_names: name to use for deduplication cluster_threshold: threshold to apply in hierarchical clustering fill_missing: whether to impute missing values in the adjacency matrix using softimpute, otherwise missing values in the adjacency matrix are filled with zeros
- Returns:
Pandas dataframe containing records with cluster id