551 research outputs found

    A practical and efficient algorithm for the k-mismatch shortest unique substring finding problem

    Get PDF
    This thesis revisits the k-mismatch shortest unique substring (SUS) finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n logk n), while maintaining a practical space complexity at O(kn), where n is the string length. When k \u3e 0, which is the hard case, the new proposal significantly improves the any-case O(n2) time complexity of the prior best method for k-mismatch SUS finding. Experimental study shows that the new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution\u27s implementation when k is small relative to n. For example, the proposed method processes a 200KB sample DNA sequence with k = 1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel resulting in further significant practical performance improvement. As an example, when using 8 cores to process a 10MB sample DNA sequence with k = 2, two parallel implementations each achieved processing times less than 1=4 that of the serial implementation. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that trading additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may require years to process a 200MB DNA sample for any k \u3e 0, while this new proposal, using 24 cores, finished processing a sample of this size with k = 1 in 206:376 seconds with a peak memory usage of 46GB, which is easily available and affordable for many users. It is expected that this new practical and efficient algorithm for k-mismatch SUS finding will prove useful to those using the measure on long sequences in fields such as computational biology

    Estimating seed sensitivity on homogeneous alignments

    Get PDF
    We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition

    PUF authentication and key-exchange by substring matching

    Get PDF
    Mechanisms for operating a prover device and a verifier device so that the verifier device can verify the authenticity of the prover device. The prover device generates a data string by: (a) submitting a challenge to a physical unclonable function (PUF) to obtain a response string, (b) selecting a substring from the response string, (c) injecting the selected substring into the data string, and (d) injecting random bits into bit positions of the data string not assigned to the selected substring. The verifier: (e) generates an estimated response string by evaluating a computational model of the PUF based on the challenge; (f) performs a search process to identify the selected substring within the data string using the estimated response string; and (g) determines whether the prover device is authentic based on a measure of similarity between the identified substring and a corresponding substring of the estimated response string

    Faster algorithms for computing maximal multirepeats in multiple sequences

    Get PDF
    A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin greater than/equal to 2) in each of at least q greater than/equal to 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string
    corecore