sampling

class deduplipy.sampling.MinHashSampler(col_names: List[str], n_hash_tables=10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word')

Bases: deduplipy.sampling.sampler.Sampler

Class to create a pairs table sample for col_names by applying minhashing with n_hash_tables hash tables. The Scikit-Learn CountVectorizer is used for tokenization.

Parameters
  • col_names – column names to use for creating pairs

  • n_hash_tables – number of hash tables to use for hashing

  • analyzer – way how CountVectorizer creates tokens

  • ngram_range – range of n-grams sizes the CountVectorizer uses

sample(X: pandas.core.frame.DataFrame, n_samples: int, threshold: float = 0.2)pandas.core.frame.DataFrame

Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if the number of pairs above the threshold is too low.

Parameters
  • X – Pandas dataframe containing records to create a sample of pairs from

  • n_samples – number of samples to create

  • threshold – Jaccard threshold for pair inclusion

Returns

Pandas dataframe containing the sampled pairs

class deduplipy.sampling.NaiveSampler(col_names: List[str], n_perfect_matches: int = 3)

Bases: deduplipy.sampling.sampler.Sampler

Class to create a pairs table sample by naively comparing all rows with all other rows. The resulting pairs will mostly contain non-matches.

Parameters
  • col_names – column names to use for creating pairs

  • n_perfect_matches – number of perfect matches to include, helps during the active learning phase

sample(X: pandas.core.frame.DataFrame, n_samples: int)pandas.core.frame.DataFrame

Method to draw sample of pairs of size n_samples from dataframe X.

Parameters
  • X – Pandas dataframe containing records to create a sample of pairs from

  • n_samples – number of samples to create

Returns

Pandas dataframe containing the sampled pairs