Distortion-resistant Hashing For Rapid Search Of Similar DNA Subsequence | Awesome Learning to Hash Add your paper to Learning2Hash

Distortion-resistant Hashing For Rapid Search Of Similar DNA Subsequence

Duda Jarek. Arxiv 2016

[Paper]    
ARXIV Independent

One of the basic tasks in bioinformatics is localizing a short subsequence \(S\), read while sequencing, in a long reference sequence \(R\), like the human geneome. A natural rapid approach would be finding a hash value for \(S\) and compare it with a prepared database of hash values for each of length \(|S|\) subsequences of \(R\). The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length \(|S|\) sequences, the hash value represents the closest sequence to \(S\). This gives some control of the distance of collisions: sequences having the same hash value.

Similar Work