sampling¶

class deduplipy.sampling.MinHashSampler(col_names: List[str], n_hash_tables=10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word')¶

Bases: deduplipy.sampling.sampler.Sampler

Class to create a pairs table sample for col_names by applying minhashing with n_hash_tables hash tables. The Scikit-Learn CountVectorizer is used for tokenization.

Parameters

col_names – column names to use for creating pairs
n_hash_tables – number of hash tables to use for hashing
analyzer – way how CountVectorizer creates tokens
ngram_range – range of n-grams sizes the CountVectorizer uses

sample(X: pandas.core.frame.DataFrame, n_samples: int, threshold: float = 0.2) → pandas.core.frame.DataFrame¶

Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if the number of pairs above the threshold is too low.

Parameters

X – Pandas dataframe containing records to create a sample of pairs from
n_samples – number of samples to create
threshold – Jaccard threshold for pair inclusion

Returns

Pandas dataframe containing the sampled pairs

class deduplipy.sampling.NaiveSampler(col_names: List[str], n_perfect_matches: int = 3)¶

Bases: deduplipy.sampling.sampler.Sampler

Class to create a pairs table sample by naively comparing all rows with all other rows. The resulting pairs will mostly contain non-matches.

Parameters

col_names – column names to use for creating pairs
n_perfect_matches – number of perfect matches to include, helps during the active learning phase

sample(X: pandas.core.frame.DataFrame, n_samples: int) → pandas.core.frame.DataFrame¶

Method to draw sample of pairs of size n_samples from dataframe X.

Parameters

X – Pandas dataframe containing records to create a sample of pairs from
n_samples – number of samples to create

Returns

Pandas dataframe containing the sampled pairs