Search CORE

8,923 research outputs found

Counting, generating and sampling tree alignments

Author: A Dress
A Torres
C Herrbach
CB Do
G Blin
H Andrade
HS Wilf
M Höchsmann
M Höchsmann
M Vingron
MS Waterman
P Flajolet
S Schirmer
T Jiang
Y Ponty
Publication venue
Publication date: 07/03/2016
Field of study

Pairwise ordered tree alignment are combinatorial objects that appear in RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis.In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees

S

and

T

. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal RNA secondary structures alignments.Comment: ALCOB - 3rd International Conference on Algorithms for Computational Biology - 2016, Jun 2016, Trujillo, Spain. 201

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL-Polytechnique

Towards realistic benchmarks for multiple alignments of non-coding sequences

Author: A Loytynoja
A Prakash
A Prakash
A Siepel
AB Diallo
AG Clark
AP Dempster
AR Subramanian
AW Dress
B Paten
BG Hall
C Notredame
CM Bergman
D Karolchik
D Tian
DA Pollard
DA Pollard
G Bejerano
G Landan
G Landan
G Lunter
G Lunter
I Van Walle
J Felsenstein
J Kim
J Kim
J Stoye
Jaebum Kim
JD Thompson
K Katoh
K Mizuguchi
L Chindelevitch
M Blanchette
M Blanchette
M Brudno
MA Larkin
MS Rosenberg
N Bray
RA Cartwright
RC Edgar
RK Bradley
RK Bradley
S Sinha
S Snir
Saurabh Sinha
TH Ogdenw
V Simossis
W Fletcher
W Huang
W Pirovano
X He
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. Results We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments. Conclusion We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Alignment uncertainty, regressive alignment and large scale deployment

Author: Floden Evan, 1985-
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 01/01/2018
Field of study

A multiple sequence alignment (MSA) provides a description of the relationship between biological sequences where columns represent a shared ancestry through an implied set of evolutionary events. The majority of research in the field has focused on improving the accuracy of alignments within the progressive alignment framework and has allowed for powerful inferences including phylogenetic reconstruction, homology modelling and disease prediction. Notwithstanding this, when applied to modern genomics datasets - often comprising tens of thousands of sequences - new challenges arise in the construction of accurate MSA. These issues can be generalised to form three basic problems. Foremost, as the number of sequences increases, progressive alignment methodologies exhibit a dramatic decrease in alignment accuracy. Additionally, for any given dataset many possible MSA solutions exist, a problem which is exacerbated with an increasing number of sequences due to alignment uncertainty. Finally, technical difficulties hamper the deployment of such genomic analysis workflows - especially in a reproducible manner - often presenting a high barrier for even skilled practitioners. This work aims to address this trifecta of problems through a web server for fast homology extension based MSA, two new methods for improved phylogenetic bootstrap supports incorporating alignment uncertainty, a novel alignment procedure that improves large scale alignments termed regressive MSA and finally a workflow framework that enables the deployment of large scale reproducible analyses across clusters and clouds titled Nextflow. Together, this work can be seen to provide both conceptual and technical advances which deliver substantial improvements to existing MSA methods and the resulting inferences.Un alineament de seqüència múltiple (MSA) proporciona una descripció de la relació entre seqüències biològiques on les columnes representen una ascendència compartida a través d'un conjunt implicat d'esdeveniments evolutius. La majoria de la investigació en el camp s'ha centrat a millorar la precisió dels alineaments dins del marc d'alineació progressiva i ha permès inferències poderoses, incloent-hi la reconstrucció filogenètica, el modelatge d'homologia i la predicció de malalties. Malgrat això, quan s'aplica als conjunts de dades de genòmica moderns, que sovint comprenen desenes de milers de seqüències, sorgeixen nous reptes en la construcció d'un MSA precís. Aquests problemes es poden generalitzar per formar tres problemes bàsics. En primer lloc, a mesura que augmenta el nombre de seqüències, les metodologies d'alineació progressiva presenten una disminució espectacular de la precisió de l'alineació. A més, per a un conjunt de dades, existeixen molts MSA com a possibles solucions un problema que s'agreuja amb un nombre creixent de seqüències a causa de la incertesa d'alineació. Finalment, les dificultats tècniques obstaculitzen el desplegament d'aquests fluxos de treball d'anàlisi genòmica, especialment de manera reproduïble, sovint presenten una gran barrera per als professionals fins i tot qualificats. Aquest treball té com a objectiu abordar aquesta trifecta de problemes a través d'un servidor web per a l'extensió ràpida d'homologia basada en MSA, dos nous mètodes per a la millora de l'arrencada filogenètica permeten incorporar incertesa d'alineació, un nou procediment d'alineació que millora els alineaments a gran escala anomenat MSA regressivu i, finalment, un marc de flux de treball permet el desplegament d'anàlisis reproduïbles a gran escala a través de clústers i computació al núvol anomenat Nextflow. En conjunt, es pot veure que aquest treball proporciona tant avanços conceptuals com tècniques que proporcionen millores substancials als mètodes MSA existents i les conseqüències resultants

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Statistical Methods for Conservation and Alignment Quality in Proteins

Author: Ahola Virpi
Publication venue: Annales Universitatis Turkuensis AII 228
Publication date: 07/11/2008
Field of study

Construction of multiple sequence alignments is a fundamental task in Bioinformatics. Multiple sequence alignments are used as a prerequisite in many Bioinformatics methods, and subsequently the quality of such methods can be critically dependent on the quality of the alignment. However, automatic construction of a multiple sequence alignment for a set of remotely related sequences does not always provide biologically relevant alignments.Therefore, there is a need for an objective approach for evaluating the quality of automatically aligned sequences. The profile hidden Markov model is a powerful approach in comparative genomics. In the profile hidden Markov model, the symbol probabilities are estimated at each conserved alignment position. This can increase the dimension of parameter space and cause an overfitting problem. These two research problems are both related to conservation. We have developed statistical measures for quantifying the conservation of multiple sequence alignments. Two types of methods are considered, those identifying conserved residues in an alignment position, and those calculating positional conservation scores. The positional conservation score was exploited in a statistical prediction model for assessing the quality of multiple sequence alignments. The residue conservation score was used as part of the emission probability estimation method proposed for profile hidden Markov models. The results of the predicted alignment quality score highly correlated with the correct alignment quality scores, indicating that our method is reliable for assessing the quality of any multiple sequence alignment. The comparison of the emission probability estimation method with the maximum likelihood method showed that the number of estimated parameters in the model was dramatically decreased, while the same level of accuracy was maintained. To conclude, we have shown that conservation can be successfully used in the statistical model for alignment quality assessment and in the estimation of emission probabilities in the profile hidden Markov models.Siirretty Doriast

UTUPub

Incorporating molecular data in fungal systematics: a guide for aspiring researchers

Author: Abarenkov Kessy
Bertrand Yann J. K.
Hartmann Martin
Hyde Kevin D.
Kauserud Håvard
Kristiansson Erik
Larsson Ellen
Manamgoda Dimuthu S.
Nilsson Henrik R.
Oxelman Bengt
Ryberg Martin
Tedersoo Leho
Udayanga Dhanushka
Publication venue
Publication date: 01/01/2013
Field of study

The last twenty years have witnessed molecular data emerge as a primary research instrument in most branches of mycology. Fungal systematics, taxonomy, and ecology have all seen tremendous progress and have undergone rapid, far-reaching changes as disciplines in the wake of continual improvement in DNA sequencing technology. A taxonomic study that draws from molecular data involves a long series of steps, ranging from taxon sampling through the various laboratory procedures and data analysis to the publication process. All steps are important and influence the results and the way they are perceived by the scientific community. The present paper provides a reflective overview of all major steps in such a project with the purpose to assist research students about to begin their first study using DNA-based methods. We also take the opportunity to discuss the role of taxonomy in biology and the life sciences in general in the light of molecular data. While the best way to learn molecular methods is to work side by side with someone experienced, we hope that the present paper will serve to lower the learning threshold for the reader.Comment: Submitted to Current Research in Environmental and Applied Mycology - comments most welcom

arXiv.org e-Print Archive

Chalmers Research

Chalmers Publication Library

Measuring Global Credibility with Application to Local Sequence Alignment

Author: Andrey Rzhetsky
B-JM Webb
Bobbie-Jo M. Webb-Robertson
BP Carlin
C Webber
Charles E. Lawrence
D Naor
DJ Lipman
HS Booth
HT Mevissen
I Holmes
J Zhu
JP Comet
JS Liu
JS Liu
JS Liu
KA Perry
KM Chao
L Yu
LE Carvalho
Lee Ann McCue
M Kendall
M Schlosshauer
M Vingron
M Vingron
M Zuker
ME Dayhoff
ML Tress
MS Waterman
R Durbin
RL Ott
S Henikoff
S Karlin
S Miyazawa
SF Altschul
SF Altschul
TF Smith
W Thompson
WR Pearson
WR Pearson
WR Pearson
Y Ding
YK Yu
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Author: A Dress
A Godzik
A Löytynoja
A Löytynoja
A Novák
A Novák
A Sali
A Siepel
A Tramontano
Adrienn Szabó
AS Schwartz
AS Schwartz
B Dwivedi
B Knudsen
B Larget
B Misof
B Schwikowski
BD Redelings
BD Redelings
BJM Webb
BP Blackburne
C Dessimoz
C Notredame
C Notredame
CB Do
CJ Challis
D Altschuh
D Chivian
D DeBlasio
D Lupyan
D Metzler
D Metzler
D Robinson
DA Morrison
DF Feng
E Levy Karin
G Jordan
G Landan
G Lunter
G Lunter
G Lunter
G Raghava
G Talavera
GA Churchill
GA Lunter
Hall B G
HT Mevissen
I Holmes
I Miklós
I Miklós
IL Dryden
IM Wallace
István Miklós
J Castresana
J Felsenstein
J Gatesy
J Hein
J Kim
J Zhu
JA Lake
JD Thompson
JD Thompson
JL Thorne
JL Thorne
JL Thorne
JL Thorne
Joseph L Herman
Jotun Hein
K Bucka-Lassen
K Liu
K Liu
KM Wong
L Wang
L Yu
LE Carvalho
LS Wang
M Hamada
M Hamada
M Hamada
M Höhl
M Vingron
M Vingron
M Wu
M Zuker
MA Suchard
MJ Wise
MO Dayhoff
MP Simmons
MS Waterman
MSY Lee
O Gotoh
O Penn
O Penn
O Penn
P Ajawatanawong
P Arunapuram
P Collingridge
PJ Green
PJ Green
PP Gardner
R Durbin
R Satija
R Satija
R Schwarzenbacher
RA Cartwright
RC Edgar
RJ Dickson
RJ Dickson
RK Bradley
Rune Lyngsø
S Capella-Gutiérrez
S Karlin
S Miyazawa
S Needleman
S Sinha
Silla-Martínez Capella-Gutiérrez S
SME Sahraeian
TA Hopf
TH Ogden
TL Blundell
U Roshan
V Ahola
W Fletcher
WC Wheeler
Y Liu
Y Ruffieux
Ádám Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

Crossref

SZTAKI Publication Repository

Springer - Publisher Connector

PubMed Central

Oxford University Research Archive