blocking¶
-
class
deduplipy.blocking.
Blocking
(col_names: List[str], rules_info: Dict, recall: float = 1.0, save_intermediate_steps: bool = False)¶ Bases:
sklearn.base.BaseEstimator
Class for fitting blocking rules and applying them on new pairs
- Parameters
col_names – list of column names, also the ones not included in blocking
rules_info – dict with column names as keys and a list of blocking functions as values
recall – minimum recall required
save_intermediate_steps – whether to save intermediate results
-
fit
(X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray]) → deduplipy.blocking.blocking.Blocking¶ Fit Blocking instance on data
- Parameters
X – array containing pairs
y – array containing whether pairs are a match or not
- Returns
fitted instance
-
transform
(X: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Applies blocking rules on new data
- Parameters
X – Pandas dataframe containing data on which blocking rules should be applied
- Returns
Pandas dataframe containing blocking rules applied on new data
-
deduplipy.blocking.
greedy_set_cover
(subsets: List, parent_set: Set, recall: float = 1.0) → List¶ Greedy set cover algorithm, stops when recall threshold is reached
- Parameters
subsets – subsets that should cover the parent_set
parent_set – parent_set that should be covered by subsets
recall – minimum recall to reach
- Returns
list containing selection of rules that collectively span the parent_set