4,504 research outputs found
A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances
Spaced seeds have been recently shown to not only detect more alignments, but
also to give a more accurate measure of phylogenetic distances (Boden et al.,
2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower
misclassification rate when used with Support Vector Machines (SVMs) (On-odera
and Shibuya, 2013), We confirm by independent experiments these two results,
and propose in this article to use a coverage criterion (Benson and Mak, 2008,
Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both
cases in order to design better seed patterns. We show first how this coverage
criterion can be directly measured by a full automaton-based approach. We then
illustrate how this criterion performs when compared with two other criteria
frequently used, namely the single-hit and multiple-hit criteria, through
correlation coefficients with the correct classification/the true distance. At
the end, for alignment-free distances, we propose an extension by adopting the
coverage criterion, show how it performs, and indicate how it can be
efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017
Designing seeds for similarity search in genomic DNA
AbstractLarge-scale comparison of genomic DNA is of fundamental importance in annotating functional elements of genomes. To perform large comparisons efficiently, BLAST (Methods: Companion Methods Enzymol 266 (1996) 460, J. Mol. Biol. 215 (1990) 403, Nucleic Acids Res. 25(17) (1997) 3389) and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed’’ of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging.This work addresses the problem of designing a seed to optimize performance of seeded alignment. We give a fast, simple algorithm based on finite automata for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, along with extensions to mixtures and inhomogeneous Markov models. We give intuition and theoretical results on which seeds are good choices. Finally, we describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice
Improved hit criteria for DNA local alignment
BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at
Invasion speeds for structured populations in fluctuating environments
We live in a time where climate models predict future increases in
environmental variability and biological invasions are becoming increasingly
frequent. A key to developing effective responses to biological invasions in
increasingly variable environments will be estimates of their rates of spatial
spread and the associated uncertainty of these estimates. Using stochastic,
stage-structured, integro-difference equation models, we show analytically that
invasion speeds are asymptotically normally distributed with a variance that
decreases in time. We apply our methods to a simple juvenile-adult model with
stochastic variation in reproduction and an illustrative example with published
data for the perennial herb, \emph{Calathea ovandensis}. These examples
buttressed by additional analysis reveal that increased variability in vital
rates simultaneously slow down invasions yet generate greater uncertainty about
rates of spatial spread. Moreover, while temporal autocorrelations in vital
rates inflate variability in invasion speeds, the effect of these
autocorrelations on the average invasion speed can be positive or negative
depending on life history traits and how well vital rates ``remember'' the
past
Designing Efficient Spaced Seeds for SOLiD Read Mapping
The advent of high-throughput sequencing technologies constituted
a major advance in genomic studies, offering new prospects in a
wide range of applications.We propose a rigorous and flexible algorithmic
solution to mapping SOLiD color-space reads to a reference genome. The
solution relies on an advanced method of seed design that uses a faithful
probabilistic model of read matches and, on the other hand, a novel
seeding principle especially adapted to read mapping. Our method can
handle both lossy and lossless frameworks and is able to distinguish, at
the level of seed design, between SNPs and reading errors. We illustrate
our approach by several seed designs and demonstrate their efficiency
Seed Ecology and Regeneration Process to Inform Seed-Based Wetland Restoration
Wetlands provide immense value to wildlife and humans but have been degrading rapidly around the world. One major challenge is the loss of native plant species in wetlands, which limits the ability of wetlands to function as they should. Restoring wetlands requires a combination of removing the cause of degradation (such as invasive plant species) and, in many cases, actively returning native plants to the site especially via seeding. Further, early plant life stages are the most vulnerable for plants and is often the time in which sown species die and fail to establish. Thus, understanding how and why seeds die or survive across species and environmental conditions can provide guidance for seed-based wetland restoration. Here, we sought to answer these important knowledge gaps through a series of greenhouse and lab experiments. First, we sought to answer what native sowing rate was needed to maximize native plant performance across a gradient of invasive species seed density, environmental conditions, and timing of seed addition. Separately, we performed a lab and growth chamber experiment in which we measured important characteristics about seeds and seedlings (grown in different environmental conditions) to better understand (and ultimately predict) why some species do well and in what conditions that can occur. Finally, in a separate greenhouse experiment, we grew native and invasive wetland plants for eight-weeks and tracked whether seeds germinated, survived, or died in order to quantify plant transitions through these early life stages. We also assessed ‘end-of-season’ percent cover and the rate of clonal production to gauge how early stages of plant growth contributes to invasion resistance. We found native plant establishment increased with higher native sowing densities, especially when native seeds were sown early in the season. However, the biggest driver in plant community composition following seeding was the density of invasive Phragmites australis seeds in the soil. Low water levels yielded higher native plant performance and more effectively suppressed P. australis growth. We also identified characteristics of seeds and seedlings that explained their germination and early growth patterns—species that had light seeds with thin seed coats and shallow seed dormancy had faster time to germination and higher growth rates, while species with heavy seeds had thick seed coats, deep seed dormancy, slower germination, and higher resource allocation to plant structures. Finally, we found that high-water levels enhanced the probability of seed germination, and that high temperatures lead to higher clonal development in seedlings. Overall, Phragmites australis was a superior performer is early life stages, but Distichlis spicata performed well due to high germination probabilities and Eleocharis palustris performed well due to extensive clonal production. As seed-based wetland restoration becomes increasingly necessary, the findings from this dissertation provide guidance on which native species should be used, where seeds should be sourced, and what environmental conditions should be targeted to maximize native plant establishment and restore wetland functions
Efficient Node Proximity and Node Significance Computations in Graphs
abstract: Node proximity measures are commonly used for quantifying how nearby or otherwise related to two or more nodes in a graph are. Node significance measures are mainly used to find how much nodes are important in a graph. The measures of node proximity/significance have been highly effective in many predictions and applications. Despite their effectiveness, however, there are various shortcomings. One such shortcoming is a scalability problem due to their high computation costs on large size graphs and another problem on the measures is low accuracy when the significance of node and its degree in the graph are not related. The other problem is that their effectiveness is less when information for a graph is uncertain. For an uncertain graph, they require exponential computation costs to calculate ranking scores with considering all possible worlds.
In this thesis, I first introduce Locality-sensitive, Re-use promoting, approximate Personalized PageRank (LR-PPR) which is an approximate personalized PageRank calculating node rankings for the locality information for seeds without calculating the entire graph and reusing the precomputed locality information for different locality combinations. For the identification of locality information, I present Impact Neighborhood Indexing (INI) to find impact neighborhoods with nodes' fingerprints propagation on the network. For the accuracy challenge, I introduce Degree Decoupled PageRank (D2PR) technique to improve the effectiveness of PageRank based knowledge discovery, especially considering the significance of neighbors and degree of a given node. To tackle the uncertain challenge, I introduce Uncertain Personalized PageRank (UPPR) to approximately compute personalized PageRank values on uncertainties of edge existence and Interval Personalized PageRank with Integration (IPPR-I) and Interval Personalized PageRank with Mean (IPPR-M) to compute ranking scores for the case when uncertainty exists on edge weights as interval values.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Predicting epidemic risk from past temporal contact data
Understanding how epidemics spread in a system is a crucial step to prevent
and control outbreaks, with broad implications on the system's functioning,
health, and associated costs. This can be achieved by identifying the elements
at higher risk of infection and implementing targeted surveillance and control
measures. One important ingredient to consider is the pattern of
disease-transmission contacts among the elements, however lack of data or
delays in providing updated records may hinder its use, especially for
time-varying patterns. Here we explore to what extent it is possible to use
past temporal data of a system's pattern of contacts to predict the risk of
infection of its elements during an emerging outbreak, in absence of updated
data. We focus on two real-world temporal systems; a livestock displacements
trade network among animal holdings, and a network of sexual encounters in
high-end prostitution. We define the node's loyalty as a local measure of its
tendency to maintain contacts with the same elements over time, and uncover
important non-trivial correlations with the node's epidemic risk. We show that
a risk assessment analysis incorporating this knowledge and based on past
structural and temporal pattern properties provides accurate predictions for
both systems. Its generalizability is tested by introducing a theoretical model
for generating synthetic temporal networks. High accuracy of our predictions is
recovered across different settings, while the amount of possible predictions
is system-specific. The proposed method can provide crucial information for the
setup of targeted intervention strategies.Comment: 24 pages, 5 figures + SI (18 pages, 15 figures
- …