Multilingual De-duplication Strategies: Applying Scalable Similarity Search With Monolingual & Multilingual Embedding Models | Awesome Learning to Hash Add your paper to Learning2Hash

Multilingual De-duplication Strategies: Applying Scalable Similarity Search With Monolingual & Multilingual Embedding Models

Stefan Pasch, Dimitirios Petridis, Jannic Cutura . Arxiv 2024 – 1 citation

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Efficiency Similarity Search

This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.

Similar Work