blocking

class deduplipy.blocking.Blocking(col_names: List[str], rules_info: Dict, recall: float = 1.0, save_intermediate_steps: bool = False)

Bases: sklearn.base.BaseEstimator

fit(X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray])deduplipy.blocking.blocking.Blocking

Fit Blocking instance on data

Args:

X: array containing pairs y: array containing whether pairs are a match or not

Returns:

fitted instance

transform(X: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Applies blocking rules on new data

Args:

X: Pandas dataframe containing data on which blocking rules should be applied

Returns:

Pandas dataframe containing blocking rules applied on new data

deduplipy.blocking.greedy_set_cover(subsets: List, parent_set: Set, recall: float = 1.0)List

Greedy set cover algorithm, stops when recall threshold is reached

Args:

subsets: subsets that should cover the parent_set parent_set: parent_set that should be covered by subsets recall: minimum recall to reach

Returns:

list containing selection of rules that collectively span the parent_set