LSH Methods For Data Deduplication In A Wikipedia Artificial Dataset

Ciro Juan, Galvez Daniel, Schlippe Tim, Kanter David. Arxiv 2021

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

Similar Work

Secure Single-server Nearly-identical Image Deduplication
Ce-dedup Cost-effective Convolutional Neural Nets Training Based On Image Deduplication
Its The Best Only When It Fits You Most Finding Related Models For Serving Based On Dynamic Locality Sensitive Hashing
Multi-probe LSH Efficient Indexing For High-dimensional Similarity Search

Awesome Learning to Hash

LSH Methods For Data Deduplication In A Wikipedia Artificial Dataset

Ciro Juan, Galvez Daniel, Schlippe Tim, Kanter David. Arxiv 2021

Similar Work