348 research outputs found

    A unifying framework for seed sensitivity and its application to subset seeds

    Get PDF
    We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

    Motif Discovery with Compact Approaches - Design and Applications

    Get PDF
    In the post-genomic era, the ability to predict the behavior, the function, or the structure of biological entities, as well as interactions among them, plays a fundamental role in the discovery of information to help biologists to explain biological mechanisms. In this context, appropriate characterization of the structures under analysis, and the exploitation of combinatorial properties of sequences, are crucial steps towards the development of efficient algorithms and data structures to be able to perform the analysis of biological sequences. Similarity is a fundamental concept in Biology. Several functional and structural properties, and evolutionary mechanisms, can be predicted comparing new elements with already classified elements, or comparing elements with a similar structure of function to infer the common mechanism that is at the basis of the observed similar behavior. Such elements are commonly called motifs. Comparison-based methods for sequence analysis find their application in several biological contexts, such as identification of transcription factor binding sites, finding structural and functional similarities in proteins, and phylogeny. Therefore the development of adequate methodologies for motif discovery is of paramount interests for several fields in computational biology. In motif discovery in biosequences, it is common to assume that statistically significant candidates are those that are likely to hide some biologically significant property. For this purpose all the possible candidates are ranked according to some statistics on words (frequency, over/under representation, etc.). Then they are presented in output for further inspection by a biologist, who identifies the most promising subsequences, and tests them in laboratory to confirm their biological significance. Therefore, when designing algorithms for motif discovery, besides obviously aim at time and space efficiency, particular attention should be devoted to the output representation. In fact, even considering fixed length strings, the size of the candidate set become exponential if exhaustive enumeration is applied. This is already true when only exact matches are considered as candidate occurrences, and worsen if some kind of variability (for example a fixed number of mismatches is allowed). Alternatively, heuristics could be used, however without the warranty of finding the optimal solution. Computational power of nowadays computers can partially reduce these effects, in particular for short length candidates. However, if the size of the output is too big to be analyzed by human inspection the risk is to provide biologists with very fast, but useless tools. A possible solution relies on compact approaches. Compact approaches are based on the partition of the search space into classes. The classes must be designed in such a way that the score used to rank the candidates has a monotone behavior within each class. This allows the identification of a representative of each class, which is the element with the highest score. Consequently, it suffices to compute, and report in output, the score only for the representatives. In fact, we are guaranteed that for each element that has not been ranked there is another one (the representative of the class it belongs to) that is at least equally significant. The final user can then be presented with an output that has the size of the partition, rather than the size of the candidate space, with obvious advantages for the human-based analysis that follows the computer-based filtering of the pattern discovery algorithm. Compact approaches find applications both in searching and discovery frameworks. They can also be applied to several motif models: exact patterns, patterns with given mismatch distribution, patterns with unknown mismatch distribution, profiles (i.e. matrices), and under both i.i.d. and Markov distributions. The purpose of this chapter is to describe the basis of compact approaches, to provide the readers with the conceptual tools for applying compact approaches to the design of their algorithm for biosequence analysis. Moreover, examples of compact approaches that have been successfully developed for several motif models (e.g. exact words, co-occurrences, words with mismatches, etc) will be explained, and experimental results to discuss their power will be presented

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    Wavefront Longest Common Subsequence Algorithm On Multicore And Gpgpu Platform.

    Get PDF
    String comparison is a central operation in numerous applications. It has a critical task in many operations such as data mining, spelling error correction and molecular biology (Tan et al, 2007; Michailidis and Margaritis, 2000)

    Data structures and algorithms for approximate string matching Zvi Galil, Raffaele Giancarlo

    Get PDF
    This paper surveys techniques for designing efficient sequential and parallel approximate string matching algorithms. Special attention is given to the methods for the construction of data structures that efficiently support primitive operations needed in approximate string matching
    corecore