9 research outputs found

    A unifying framework for seed sensitivity and its application to subset seeds

    Get PDF
    We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

    A FAST ALGORITHM FOR COMPUTING HIGHLY SENSITIVE MULTIPLE SPACED SEEDS

    Get PDF
    The main goal of homology search is to find similar segments, or local alignments, be­ tween two DNA or protein sequences. Since the dynamic programming algorithm of Smith- Waterman is too slow, heuristic methods have been designed to achieve both efficiency and accuracy. Seed-based methods were made well known by their use in BLAST, the most widely used software program in biological applications. The seed of BLAST trades sensitivity for speed and spaced seeds were introduced in PatternHunter to achieve both. Several seeds are better than one and near perfect sensitivity can be obtained while maintaining the speed. There­ fore, multiple spaced seeds quickly became the state-of-the-art in similarity search, being em­ ployed by many software programs. However, the quality of these seeds is crucial and comput­ ing optimal multiple spaced seeds is NP-hard. All but one of the existing heuristic algorithms for computing good seeds are exponential. Our work has two main goals. First we engineer the only existing polynomial-time heuristic algorithm to compute better seeds than any other program, while running orders of magnitude faster. Second, we estimate its performance by comparing its seeds with the optimal seeds in a few practical cases. In order to make the computation feasible, a very fast implementation of the sensitivity function is provided

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    Computation of Sensitive Multiple Spaced Seeds

    Get PDF
    Similarity search is one of the most important problem in bioinformatics, with application in read mapping, homology search, oligonucleotide design, etc. Similarity search is time and memory intensive, hence heuristic methods using multiple spaced seeds are commonly employed. A spaced seed is a string of 1 and *, where 1 represents a match position and * represent don\u27t care position. Seeds are used to discover regions with identity, thus, it is imperative to design seeds of high sensitivity, so as to maximize the number of hits. We present SpEED2, a software program to generate multiple spaced seeds of high sensitivity. It uses a novel seed optimization approach and it outperforms all the leading programs used for designing multiple spaced seeds like Iedera, AcoSeeD, and rasbhari. Our algorithm will benefit several software that is dependent on good quality seeds for its operation like PatternHunter for similarity search, SHRiMP and BFAST for read mapping, bestPrimer for designing primers, and many more

    On Improving Stochastic Simulation for Systems Biology

    Get PDF
    Mathematical modeling and computer simulation are powerful approaches for understanding the complexity of biological systems. In particular, computer simulation represents a strong validation and fast hypothesis verification tool. In the course of the years, several successful attempts have been made to simulate complex biological processes like metabolic pathways, gene regulatory networks and cell signaling pathways. These processes are stochastic in nature, and furthermore they are characterized by multiple time scale evolutions and great variability in the population size of molecules. The most known method to capture random time evolutions of well-stirred chemical reacting systems is the Gillespie's Stochastic Simulation Algorithm. This Monte carlo method generates exact realizations of the state of the system by stochastically determining when a reaction will occurs and what reaction it will be. Most of the assumptions and hypothesis are clearly simplifications but in many cases this method have been proved useful to capture the randomness typical of realistic biological systems. Unfortunately, often the Gillespie's stochastic simulation method results slow in practice. This posed a great challenge and a motivation toward the development of new efficient methods able to simulate stochastic and multiscale biological systems. In this thesis we address the problems of simulating metabolic experiments and develop efficient simulation methods for well-stirred chemically reacting systems. We showed as a Systems Biology approach can provide a cheap, fast and powerful method for validating models proposed in literature. In the present case, we specified the model of SRI photocycle proposed by Hoff et al. in a suitable developed simulator. This simulator was specifically designed to reproduce in silico wet-lab experiments performed on metabolic networks with several possible controls exerted on them by the operator. Thanks to this, we proved that the screened model is able to explain correctly many light responses but unfortunately it was unable to explain some critical experiments, due to some unresolvable time scale problems. This confirm that our simulator is useful to simulate metabolic experiments. Furthermore, it can be downloaded at the URL http://sourceforge.net/projects/gillespie-qdc. In order to accelerate the simulation of SSA we first proposed a data parallel implementation on General Purpose Graphics Processing Units of a revised version of the Gillespie's First Reaction Method. The simulations performed on a GeForce 8600M GS Graphic Card with 16 stream processors showed that the parallel computations halves the execution time, and this performance scales with the number of steps of the simulation. We also highlighted some specific problem of the programming environment to execute non trivial general purpose applications. Concluding we proved the extreme computational power of these low cost and widespread technologies, but the limitations emerged demonstrate that we are far from a general purpose application for GPU. In our investigation we also attempted to achieve higher simulation speed focusing on tau-leaping methods. We revealed that these methods implement a common basic algorithmic convention. This convention is the pre-computation of information necessary to estimate the size of the leap and the number of reactions that will fire on it. Often these pre-processing operations are used to avoid negative populations. The computational cost to perform these operations is often proportional to the size of the model (i.e. number of reactions). This means that larger models involve larger computational cost. The pre-processing operations result in very efficient simulation when the leap are long and many reactions can be fired. But at the contrary they represent a burden when leap are short and few reactions occur. So to efficiently deal with the latter cases we proposed a method that works differently respect to the trend. The SSALeaping method, SSAL for short, is a new method which lays in the middle between the direct method (DM) and a tau-leaping. The SSALeaping method adaptively builds leaps and stepwise updates the system state. Differently from methods like the Modified tau-leaping (MTL), SSAL neither shifts from tau-leaping to DM nor pre-selects the largest leap time consistent with the leap condition. Additionally whereas MTL prevents negative populations taking apart critical and non critical reactions, SSAL generates sequentially the reactions to fire verifying the leap condition after each reaction selection. We proved that a reaction overdraws one of its reactants if and only if the leap condition is violated. Therefore, this makes it impossible for the population to become negatives, because SSAL stops the leap generation in advance. To test the accuracy and the performance of our method we performed a large number of simulations upon realistic biological models. The tests aimed to span the number of reactions fired in a leap and the number of reactions of the system as much as possible. Sometimes orders of magnitude. Results showed that our method performs better than MTL for many of the tested cases, but not in all. Then to augment the number of models eligible to be simulated efficiently we exploiting the complementarity emerged between SSAL and MTL, and we proposed a new adaptive method, called Adaptive Modified SSALeaping (AMS). During the simulation, our method switches between SSALeaping (SSAL) and Modified tau-leaping, according to conditions on the number of reactions of the model and the predicted number of reactions firing in a leap. We were able to find both theoretically and experimentally how to estimate the number of reactions that will fire in a leap and the threshold that determines the switch from one method to the other and viceversa. Results obtained from realistic biological models showed that in practice AMS performs better than SSAL and MTL by augmenting the number of models eligible ro be simulated efficiently. In fact, the method selects correctly the best algorithm between SSAL and MTL according to the cases. In this thesis we also investigated other new parallelization techniques. The parallelization of biological systems stimulated the interest of many researchers because the nature of these systems is parallel and sometimes distributed. However, the nature of the Gillespie's SSA is strictly sequential. We presented a novel exact formulation of SSA based on the idea of partitioning the volume. We proved the equivalence between our method and DM, and we have given a simple test to show its accuracy in practice. Then we proposed a variant of SSALeaping based on the partitioning of the volume, called Partitioned SSALeaping. The main feature we pointed out is that the dynamics of a system in a leap can be obtained by the composition of the dynamics processed by each sub-volume of the partition. This form of independency gives a different view with respect to existing methods. We only tested the method on a simple model, and we showed that the method accurately matched the results of DM, independently of the number of sub-volumes in the partition. This confirmed that the method works and that independency is effective. We have not already given parallel implementation of this method because this work is still in progress and much work has to be done. Nevertheless, the Partitioned SSAleaping is a promising approach for a future parallelization on multi core (e.g. GPU's) or in many core (e.g. cluster) technologies

    Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

    Get PDF
    The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment

    Subset seed automaton

    Get PDF
    We study the pattern matching automaton introduced in [1] for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of the automaton construction. We also present some experimental results and show that this automaton can be successfully applied to more general situations
    corecore