1 research outputs found

    Distortion-Resistant Hashing for rapid search of similar DNA subsequence

    Full text link
    One of the basic tasks in bioinformatics is localizing a short subsequence SS, read while sequencing, in a long reference sequence RR, like the human geneome. A natural rapid approach would be finding a hash value for SS and compare it with a prepared database of hash values for each of length ∣S∣|S| subsequences of RR. The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length ∣S∣|S| sequences, the hash value represents the closest sequence to SS. This gives some control of the distance of collisions: sequences having the same hash value.Comment: 5 pages, 4 figure
    corecore