blocking

class deduplipy.blocking.Blocking(col_names: List[str], rules_info: Dict, recall: float = 1.0, save_intermediate_steps: bool = False)

Bases: sklearn.base.BaseEstimator

Class for fitting blocking rules and applying them on new pairs

Parameters
  • col_names – list of column names, also the ones not included in blocking

  • rules_info – dict with column names as keys and a list of blocking functions as values

  • recall – minimum recall required

  • save_intermediate_steps – whether to save intermediate results

fit(X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray])deduplipy.blocking.blocking.Blocking

Fit Blocking instance on data

Parameters
  • X – array containing pairs

  • y – array containing whether pairs are a match or not

Returns

fitted instance

transform(X: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Applies blocking rules on new data

Parameters

X – Pandas dataframe containing data on which blocking rules should be applied

Returns

Pandas dataframe containing blocking rules applied on new data

deduplipy.blocking.greedy_set_cover(subsets: List, parent_set: Set, recall: float = 1.0)List

Greedy set cover algorithm, stops when recall threshold is reached

Parameters
  • subsets – subsets that should cover the parent_set

  • parent_set – parent_set that should be covered by subsets

  • recall – minimum recall to reach

Returns

list containing selection of rules that collectively span the parent_set