Search CORE

39,337 research outputs found

Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

Author: A Kraskov
A Milosavljević
G Navarro
J Felsenstein
J Lake
J Rissanen
J Rissanen
J Thompson
J Varre
Konrad Scheffler
L Allison
M Brudno
M Brudno
M Cao
M Li
M Li
M Mahoney
M Nei
M Steel
Maya Paczuski
N Bray
N Bray
N Saitou
Orion Penner
P Buneman
P Lockhart
P Viola
Peter Grassberger
R Cilibrasi
R Durbin
S Altschul
S Altschul
S McGinnis
S Vinga
T Cover
T Lassmann
W Press
X Chen
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 19/08/2010
Field of study

Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

IMT Institutional Repository

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.

Author: Quang Daniel
Xie Xiaohui
Publication venue: eScholarship, University of California
Publication date: 15/04/2016
Field of study

Modeling the properties and functions of DNA sequences is an important, but challenging task in the broad field of genomics. This task is particularly difficult for non-coding DNA, the vast majority of which is still poorly understood in terms of function. A powerful predictive model for the function of non-coding DNA can have enormous benefit for both basic science and translational research because over 98% of the human genome is non-coding and 93% of disease-associated variants lie in these regions. To address this need, we propose DanQ, a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting non-coding function de novo from sequence. In the DanQ model, the convolution layer captures regulatory motifs, while the recurrent layer captures long-term dependencies between the motifs in order to learn a regulatory 'grammar' to improve predictions. DanQ improves considerably upon other models across several metrics. For some regulatory markers, DanQ can achieve over a 50% relative improvement in the area under the precision-recall curve metric compared to related models. We have made the source code available at the github repository http://github.com/uci-cbcl/DanQ

PubMed Central

eScholarship - University of California

Geometric combinatorics and computational molecular biology: branching polytopes for RNA sequences

Author: Drellich Elizabeth
Gainer-Dewar Andrew
Harrington Heather A.
He Qijun
Heitsch Christine
Poznanović Svetlana
Publication venue
Publication date: 16/06/2016
Field of study

Questions in computational molecular biology generate various discrete optimization problems, such as DNA sequence alignment and RNA secondary structure prediction. However, the optimal solutions are fundamentally dependent on the parameters used in the objective functions. The goal of a parametric analysis is to elucidate such dependencies, especially as they pertain to the accuracy and robustness of the optimal solutions. Techniques from geometric combinatorics, including polytopes and their normal fans, have been used previously to give parametric analyses of simple models for DNA sequence alignment and RNA branching configurations. Here, we present a new computational framework, and proof-of-principle results, which give the first complete parametric analysis of the branching portion of the nearest neighbor thermodynamic model for secondary structure prediction for real RNA sequences.Comment: 17 pages, 8 figure

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Reliability analysis of reconstructing phylogenies under long branch attraction conditions

Author: Dissanayake Ranjan
Publication venue
Publication date: 01/05/2018
Field of study

Master's Project (M.S.) University of Alaska Fairbanks, 2018.In this simulation study we examined the reliability of three phylogenetic reconstruction techniques in a long branch attraction (LBA) situation: Maximum Parsimony (M P), Neighbor Joining (NJ), and Maximum Likelihood. Data were simulated under five DNA substitution models-JC, K2P, F81, HKY, and G T R-from four different taxa. Two branch length parameters of four taxon trees ranging from 0.05 to 0.75 with an increment of 0.02 were used to simulate DNA data under each model. For each model we simulated DNA sequences with 100, 250, 500 and 1000 sites with 100 replicates. When we have enough data the maximum likelihood technique is the most reliable of the three methods examined in this study for reconstructing phylogenies under LBA conditions. We also find that MP is the most sensitive to LBA conditions and that Neighbor Joining performs well under LBA conditions compared to MP

ScholarWorks@UA

Paradigms for computational nucleic acid design

Author: Dirks Robert M.
Lin Milo
Pierce Niles A.
Winfree Erik
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2004
Field of study

The design of DNA and RNA sequences is critical for many endeavors, from DNA nanotechnology, to PCR‐based applications, to DNA hybridization arrays. Results in the literature rely on a wide variety of design criteria adapted to the particular requirements of each application. Using an extensively studied thermodynamic model, we perform a detailed study of several criteria for designing sequences intended to adopt a target secondary structure. We conclude that superior design methods should explicitly implement both a positive design paradigm (optimize affinity for the target structure) and a negative design paradigm (optimize specificity for the target structure). The commonly used approaches of sequence symmetry minimization and minimum free‐energy satisfaction primarily implement negative design and can be strengthened by introducing a positive design component. Surprisingly, our findings hold for a wide range of secondary structures and are robust to modest perturbation of the thermodynamic parameters used for evaluating sequence quality, suggesting the feasibility and ongoing utility of a unified approach to nucleic acid design as parameter sets are refined further. Finally, we observe that designing for thermodynamic stability does not determine folding kinetics, emphasizing the opportunity for extending design criteria to target kinetic features of the energy landscape

CiteSeerX

Caltech Authors