deduplicator

class deduplipy.deduplicator.Deduplicator(col_names: Optional[List[str]] = None, field_info: Optional[Dict] = None, interaction: bool = False, rules: Optional[Union[List[Callable], Dict]] = None, recall=1.0, save_intermediate_steps: bool = False, verbose: Union[int, bool] = 0)

Bases: object

Deduplicate entries in Pandas dataframe using columns with names col_names. Training takes place during a short, interactive session (interactive learning).

Example

>>> df = ...
>>> myDedupliPy = Deduplicator(['name', 'address'])
>>> myDedupliPy.fit(df)
>>> myDedupliPy.predict(df)

The result is a dataframe with a new column deduplication_id. Rows with the same deduplication_id are deduplicated.

Parameters
  • col_names – list of column names to be used for deduplication, if col_names is provided, field_info can be set to None as it will be neglected

  • field_info – dict containing column names as keys and lists of metrics per column name as values, only used when col_names is None

  • interaction – whether to include interaction features

  • rules – list of blocking functions to use for all columns or a dict containing column names as keys and lists of blocking functions as values, if not provided, all default rules will be used for all columns

  • recall – desired recall reached by blocking rules

  • save_intermediate_steps – whether to save intermediate results in csv files for analysis

  • verbose – sets verbosity

fit(X: pandas.core.frame.DataFrame, n_samples: int = 10000)deduplipy.deduplicator.deduplicator.Deduplicator

Fit the deduplicator instance

Parameters
  • X – Pandas dataframe to be used for fitting

  • n_samples – number of pairs to be created for active learning

Returns

trained deduplicator instance

predict(X: pandas.core.frame.DataFrame, score_threshold: float = 0.1, cluster_threshold: float = 0.5, fill_missing=True)pandas.core.frame.DataFrame

Predict on new data using the trained deduplicator.

Parameters
  • X – Pandas dataframe with column as used when fitting deduplicator instance

  • score_threshold – Classification threshold to use for filtering before starting hierarchical clustering

  • cluster_threshold – threshold to apply in hierarchical clustering

  • fill_missing – whether to apply missing value imputation on adjacency matrix

Returns

Pandas dataframe with a new column deduplication_id. Rows with the same deduplication_id are deduplicated.