Deduplication In A Massive Clinical Note Dataset | Awesome Learning to Hash Add your paper to Learning2Hash

Deduplication In A Massive Clinical Note Dataset

Sanjeev Shenoy, Tsung-Ting Kuo, Rodney Gabriel, Julian McAuley, Chun-Nan Hsu . Arxiv 2017 – 1 citation

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Datasets Hashing Methods Locality-Sensitive-Hashing

Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm.

Similar Work