Distortion-resistant Hashing For Rapid Search Of Similar DNA Subsequence
Duda Jarek. Arxiv 2016
[Paper]
ARXIV
Independent
One of the basic tasks in bioinformatics is localizing a short subsequence
, read while sequencing, in a long reference sequence , like the human
geneome. A natural rapid approach would be finding a hash value for and
compare it with a prepared database of hash values for each of length
subsequences of . The problem with such approach is that it would only spot
a perfect match, while in reality there are lots of small changes:
substitutions, deletions and insertions.
This issue could be repaired if having a hash function designed to tolerate
some small distortion accordingly to an alignment metric (like
Needleman-Wunch): designed to make that two similar sequences should most
likely give the same hash value. This paper discusses construction of
Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid
search of similar subsequences. The proposed approach is based on the rate
distortion theory: in a nearly uniform subset of length sequences, the
hash value represents the closest sequence to . This gives some control of
the distance of collisions: sequences having the same hash value.
Similar Work