sampling¶
-
class
deduplipy.sampling.
MinHashSampler
(col_names: List[str], n_hash_tables=10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word')¶ Bases:
deduplipy.sampling.sampler.Sampler
Class to create a pairs table sample for col_names by applying minhashing with n_hash_tables hash tables. The Scikit-Learn CountVectorizer is used for tokenization.
- Parameters
col_names – column names to use for creating pairs
n_hash_tables – number of hash tables to use for hashing
analyzer – way how CountVectorizer creates tokens
ngram_range – range of n-grams sizes the CountVectorizer uses
-
sample
(X: pandas.core.frame.DataFrame, n_samples: int, threshold: float = 0.2) → pandas.core.frame.DataFrame¶ Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if the number of pairs above the threshold is too low.
- Parameters
X – Pandas dataframe containing records to create a sample of pairs from
n_samples – number of samples to create
threshold – Jaccard threshold for pair inclusion
- Returns
Pandas dataframe containing the sampled pairs
-
class
deduplipy.sampling.
NaiveSampler
(col_names: List[str], n_perfect_matches: int = 3)¶ Bases:
deduplipy.sampling.sampler.Sampler
Class to create a pairs table sample by naively comparing all rows with all other rows. The resulting pairs will mostly contain non-matches.
- Parameters
col_names – column names to use for creating pairs
n_perfect_matches – number of perfect matches to include, helps during the active learning phase
-
sample
(X: pandas.core.frame.DataFrame, n_samples: int) → pandas.core.frame.DataFrame¶ Method to draw sample of pairs of size n_samples from dataframe X.
- Parameters
X – Pandas dataframe containing records to create a sample of pairs from
n_samples – number of samples to create
- Returns
Pandas dataframe containing the sampled pairs