blocking¶
-
class
deduplipy.blocking.
Blocking
(col_names, rules=None, recall=1.0, save_intermediate_steps=False)¶ Bases:
sklearn.base.BaseEstimator
-
fit
(X, y)¶ Fit Blocking instance on data
- Args:
X: array containing pairs y: array containing whether pairs are a match or not
- Returns:
fitted instance
-
transform
(X)¶ Applies blocking rules on new data
- Args:
X: Pandas dataframe containing data on which blocking rules should be applied
- Returns:
Pandas dataframe containing blocking rules applied on new data
-
-
deduplipy.blocking.
greedy_set_cover
(subsets, parent_set, recall=1.0)¶ Greedy set cover algorithm, stops when recall threshold is reached
- Args:
subsets: subsets that should cover the parent_set parent_set: parent_set that should be covered by subsets recall: minimum recall to reach
- Returns:
list containing selection of rules that collectively span the parent_set