active_learning

class deduplipy.active_learning.ActiveStringMatchLearner(col_names: List[str], interaction: bool = False, uncertainty_threshold: float = 0.1, verbose: Union[int, bool] = 0, uncertainty_improvement_threshold: float = 0.01, min_nr_entries: int = 10)

Bases: object

Class to train a string matching model using active learning.

Parameters
  • col_names – column names to use for matching

  • interaction – whether to include interaction features

  • uncertainty_threshold – threshold on the uncertainty of the classifier during active learning, used for determining if the model has converged

  • uncertainty_improvement_threshold – threshold on the uncertainty improvement of classifier during active learning, used for determining if the model has converged

  • verbose – sets verbosity

  • min_nr_entries – minimum number of responses required before classifier convergence is tested

fit(X: pandas.core.frame.DataFrame)deduplipy.active_learning.active_learning.ActiveStringMatchLearner

Fit ActiveStringMatchLearner instance on pairs of strings

Parameters

X – Pandas dataframe containing pairs of strings

predict(X: Union[pandas.core.frame.DataFrame, numpy.ndarray])numpy.ndarray

Predict on new data whether the pairs are a match or not

Parameters

X – Pandas dataframe to predict on

Returns

predictions

predict_proba(X: Union[pandas.core.frame.DataFrame, numpy.ndarray])numpy.ndarray

Predict probabilities on new data whether the pairs are a match or not

Parameters

X – Pandas dataframe to predict on

Returns

match probabilities