4,504 research outputs found

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    Designing seeds for similarity search in genomic DNA

    Get PDF
    AbstractLarge-scale comparison of genomic DNA is of fundamental importance in annotating functional elements of genomes. To perform large comparisons efficiently, BLAST (Methods: Companion Methods Enzymol 266 (1996) 460, J. Mol. Biol. 215 (1990) 403, Nucleic Acids Res. 25(17) (1997) 3389) and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed’’ of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging.This work addresses the problem of designing a seed to optimize performance of seeded alignment. We give a fast, simple algorithm based on finite automata for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, along with extensions to mixtures and inhomogeneous Markov models. We give intuition and theoretical results on which seeds are good choices. Finally, we describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice

    Improved hit criteria for DNA local alignment

    Get PDF
    BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at

    Invasion speeds for structured populations in fluctuating environments

    Get PDF
    We live in a time where climate models predict future increases in environmental variability and biological invasions are becoming increasingly frequent. A key to developing effective responses to biological invasions in increasingly variable environments will be estimates of their rates of spatial spread and the associated uncertainty of these estimates. Using stochastic, stage-structured, integro-difference equation models, we show analytically that invasion speeds are asymptotically normally distributed with a variance that decreases in time. We apply our methods to a simple juvenile-adult model with stochastic variation in reproduction and an illustrative example with published data for the perennial herb, \emph{Calathea ovandensis}. These examples buttressed by additional analysis reveal that increased variability in vital rates simultaneously slow down invasions yet generate greater uncertainty about rates of spatial spread. Moreover, while temporal autocorrelations in vital rates inflate variability in invasion speeds, the effect of these autocorrelations on the average invasion speed can be positive or negative depending on life history traits and how well vital rates ``remember'' the past

    Designing Efficient Spaced Seeds for SOLiD Read Mapping

    Get PDF
    The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency

    Seed Ecology and Regeneration Process to Inform Seed-Based Wetland Restoration

    Get PDF
    Wetlands provide immense value to wildlife and humans but have been degrading rapidly around the world. One major challenge is the loss of native plant species in wetlands, which limits the ability of wetlands to function as they should. Restoring wetlands requires a combination of removing the cause of degradation (such as invasive plant species) and, in many cases, actively returning native plants to the site especially via seeding. Further, early plant life stages are the most vulnerable for plants and is often the time in which sown species die and fail to establish. Thus, understanding how and why seeds die or survive across species and environmental conditions can provide guidance for seed-based wetland restoration. Here, we sought to answer these important knowledge gaps through a series of greenhouse and lab experiments. First, we sought to answer what native sowing rate was needed to maximize native plant performance across a gradient of invasive species seed density, environmental conditions, and timing of seed addition. Separately, we performed a lab and growth chamber experiment in which we measured important characteristics about seeds and seedlings (grown in different environmental conditions) to better understand (and ultimately predict) why some species do well and in what conditions that can occur. Finally, in a separate greenhouse experiment, we grew native and invasive wetland plants for eight-weeks and tracked whether seeds germinated, survived, or died in order to quantify plant transitions through these early life stages. We also assessed ‘end-of-season’ percent cover and the rate of clonal production to gauge how early stages of plant growth contributes to invasion resistance. We found native plant establishment increased with higher native sowing densities, especially when native seeds were sown early in the season. However, the biggest driver in plant community composition following seeding was the density of invasive Phragmites australis seeds in the soil. Low water levels yielded higher native plant performance and more effectively suppressed P. australis growth. We also identified characteristics of seeds and seedlings that explained their germination and early growth patterns—species that had light seeds with thin seed coats and shallow seed dormancy had faster time to germination and higher growth rates, while species with heavy seeds had thick seed coats, deep seed dormancy, slower germination, and higher resource allocation to plant structures. Finally, we found that high-water levels enhanced the probability of seed germination, and that high temperatures lead to higher clonal development in seedlings. Overall, Phragmites australis was a superior performer is early life stages, but Distichlis spicata performed well due to high germination probabilities and Eleocharis palustris performed well due to extensive clonal production. As seed-based wetland restoration becomes increasingly necessary, the findings from this dissertation provide guidance on which native species should be used, where seeds should be sourced, and what environmental conditions should be targeted to maximize native plant establishment and restore wetland functions

    Efficient Node Proximity and Node Significance Computations in Graphs

    Get PDF
    abstract: Node proximity measures are commonly used for quantifying how nearby or otherwise related to two or more nodes in a graph are. Node significance measures are mainly used to find how much nodes are important in a graph. The measures of node proximity/significance have been highly effective in many predictions and applications. Despite their effectiveness, however, there are various shortcomings. One such shortcoming is a scalability problem due to their high computation costs on large size graphs and another problem on the measures is low accuracy when the significance of node and its degree in the graph are not related. The other problem is that their effectiveness is less when information for a graph is uncertain. For an uncertain graph, they require exponential computation costs to calculate ranking scores with considering all possible worlds. In this thesis, I first introduce Locality-sensitive, Re-use promoting, approximate Personalized PageRank (LR-PPR) which is an approximate personalized PageRank calculating node rankings for the locality information for seeds without calculating the entire graph and reusing the precomputed locality information for different locality combinations. For the identification of locality information, I present Impact Neighborhood Indexing (INI) to find impact neighborhoods with nodes' fingerprints propagation on the network. For the accuracy challenge, I introduce Degree Decoupled PageRank (D2PR) technique to improve the effectiveness of PageRank based knowledge discovery, especially considering the significance of neighbors and degree of a given node. To tackle the uncertain challenge, I introduce Uncertain Personalized PageRank (UPPR) to approximately compute personalized PageRank values on uncertainties of edge existence and Interval Personalized PageRank with Integration (IPPR-I) and Interval Personalized PageRank with Mean (IPPR-M) to compute ranking scores for the case when uncertainty exists on edge weights as interval values.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Predicting epidemic risk from past temporal contact data

    Full text link
    Understanding how epidemics spread in a system is a crucial step to prevent and control outbreaks, with broad implications on the system's functioning, health, and associated costs. This can be achieved by identifying the elements at higher risk of infection and implementing targeted surveillance and control measures. One important ingredient to consider is the pattern of disease-transmission contacts among the elements, however lack of data or delays in providing updated records may hinder its use, especially for time-varying patterns. Here we explore to what extent it is possible to use past temporal data of a system's pattern of contacts to predict the risk of infection of its elements during an emerging outbreak, in absence of updated data. We focus on two real-world temporal systems; a livestock displacements trade network among animal holdings, and a network of sexual encounters in high-end prostitution. We define the node's loyalty as a local measure of its tendency to maintain contacts with the same elements over time, and uncover important non-trivial correlations with the node's epidemic risk. We show that a risk assessment analysis incorporating this knowledge and based on past structural and temporal pattern properties provides accurate predictions for both systems. Its generalizability is tested by introducing a theoretical model for generating synthetic temporal networks. High accuracy of our predictions is recovered across different settings, while the amount of possible predictions is system-specific. The proposed method can provide crucial information for the setup of targeted intervention strategies.Comment: 24 pages, 5 figures + SI (18 pages, 15 figures
    corecore