94,302 research outputs found

    Analysis of string-searching algorithms on biological sequence databases

    Get PDF
    String-searching algorithms are used to find the occurrences of a search string in a given text. The advent of digital computers has stimulated the development of string-searching algorithms for various applications. Here, we report the performance of all string-searching algorithms on widely used biological sequence databases containing the building blocks of nucleotides (in the case of nucleic acid sequence database) and amino acids (in the case of protein sequence database). The biological sequence databases used in the present study are Protein Information Resource (PIR), SWISSPROT, and amino acid and nucleotide sequences of all genomes available in the genome database. The average time taken for different search-string lengths considered for study has been taken as an indicator of performance for comparison between various methods

    Comparison of Sequence Alignment Algorithms

    Get PDF
    The fact that biological sequences can be represented as strings belonging to a finite alphabet (A, C, G, and T for DNA) plays an important role in connecting biology to computer science. String representation allows researchers to apply various string comparison techniques available in computer science. As a result, various applications have been developed that facilitate the task of sequence alignment. The problem of finding sequence alignments consists of finding the best match between two biological sequences. A best match can infer an evolutionary relationship and functional similarity. However, there is a lack of research on how reliable and efficient these applications are especially when it comes to comparing two sequences that might not be highly similar (but could have common patterns that are small yet biologically significant). This study compares two biological sequence comparison packages, namely WuBlast2 and Fasta3, which implement Blast and FastA algorithms, respectively. In order to do so, a framework was developed to facilitate the task of data collection and create meaningful reports. Amino acid sequences corresponding to related proteins, as well as the DNA sequences encoding these proteins, were analyzed with matching parameters for each application. Observations showed a trend of increasing variations between the matches produced by the two applications with decreasing sequence similarity

    Subsequences and Supersequences of Strings

    Get PDF
    Stringology - the study of strings - is a branch of algorithmics which been the sub-ject of mounting interest in recent years. Very recently, two books [M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, 1995] and [G. Stephen, String Searching Algorithms, World Scientific, 1994] have been published on the subject and at least two others are known to be in preparation. Problems on strings arise in information retrieval, version control, automatic spelling correction, and many other domains. However the greatest motivation for recent work in stringology has come from the field of molecular biology. String problems occur, for example, in genetic sequence construction, genetic sequence comparison, and phylogenetic tree construction. In this thesis we study a variety of string problems from a theoretical perspective. In particular, we focus on problems involving subsequences and supersequences of strings

    Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

    Get PDF
    Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia

    Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity

    Full text link
    Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it mitigates iterated differential attacks proposed by Goodrich. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms.Comment: 6 pages, 4 figure
    • …
    corecore