Search CORE

10 research outputs found

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Author: A Dress
A Godzik
A Löytynoja
A Löytynoja
A Novák
A Novák
A Sali
A Siepel
A Tramontano
Adrienn Szabó
AS Schwartz
AS Schwartz
B Dwivedi
B Knudsen
B Larget
B Misof
B Schwikowski
BD Redelings
BD Redelings
BJM Webb
BP Blackburne
C Dessimoz
C Notredame
C Notredame
CB Do
CJ Challis
D Altschuh
D Chivian
D DeBlasio
D Lupyan
D Metzler
D Metzler
D Robinson
DA Morrison
DF Feng
E Levy Karin
G Jordan
G Landan
G Lunter
G Lunter
G Lunter
G Raghava
G Talavera
GA Churchill
GA Lunter
Hall B G
HT Mevissen
I Holmes
I Miklós
I Miklós
IL Dryden
IM Wallace
István Miklós
J Castresana
J Felsenstein
J Gatesy
J Hein
J Kim
J Zhu
JA Lake
JD Thompson
JD Thompson
JL Thorne
JL Thorne
JL Thorne
JL Thorne
Joseph L Herman
Jotun Hein
K Bucka-Lassen
K Liu
K Liu
KM Wong
L Wang
L Yu
LE Carvalho
LS Wang
M Hamada
M Hamada
M Hamada
M Höhl
M Vingron
M Vingron
M Wu
M Zuker
MA Suchard
MJ Wise
MO Dayhoff
MP Simmons
MS Waterman
MSY Lee
O Gotoh
O Penn
O Penn
O Penn
P Ajawatanawong
P Arunapuram
P Collingridge
PJ Green
PJ Green
PP Gardner
R Durbin
R Satija
R Satija
R Schwarzenbacher
RA Cartwright
RC Edgar
RJ Dickson
RJ Dickson
RK Bradley
Rune Lyngsø
S Capella-Gutiérrez
S Karlin
S Miyazawa
S Needleman
S Sinha
Silla-Martínez Capella-Gutiérrez S
SME Sahraeian
TA Hopf
TH Ogden
TL Blundell
U Roshan
V Ahola
W Fletcher
WC Wheeler
Y Liu
Y Ruffieux
Ádám Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

Crossref

SZTAKI Publication Repository

Springer - Publisher Connector

PubMed Central

Oxford University Research Archive

PC Mini-Grids for Prediction of Viral RNA Structure and Evolution Final Report

Author: Bardram Jakob Eyvind
Publication venue: IT-Universitetet i København
Publication date: 01/01/2013
Field of study

The IT University of Copenhagen's Repository

Approaches to Efficient Multiple Sequence Alignment and Protein Search

Author: Szabó Adrienn
Publication venue
Publication date: 01/01/2016
Field of study

ELTE Digital Institutional Repository (EDIT)

Using phylogenetics and model selection to investigate the evolution of RNA genes in genomic alignments.

Author: Allen James
Publication venue
Publication date: 01/08/2014
Field of study

The University of Manchester - Institutional Repository

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Author: Kiyoshi Ezawa
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

Springer - Publisher Connector

Robust Algorithms for Detecting Hidden Structure in Biological Data

Author: Sloutsky Roman
Publication venue: Washington University Open Scholarship
Publication date: 15/08/2017
Field of study

Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

Washington University St. Louis: Open Scholarship

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Author: A Löytynoja
A Löytynoja
AB Diallo
AB Diallo
AR Subramanian
B Marsden
B Paten
BP Blackburne
C Notredame
C Notredame
CB Do
CL Strope
D Feng
D Graur
D Gusfield
D Villar
DA Morrison
EA O’Brien
EL Braun
G Landan
G Landan
G Landan
G Lunter
G Lunter
GA Lunter
I Holmes
I Holmes
I Walle Van
J Felsenstein
J Felsenstein
J Kim
J Kim
J Pei
JA Eisen
JD Thompson
JD Thompson
JD Thompson
JM Chang
JS Farris
K Arnold
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Katoh
K Katoh
K Katoh
Kiyoshi Ezawa
KM Wong
KS Pollard
L Chindelevitch
L Wang
LA Stebbings
LM Wallace
M Lynch
MA Suchard
MP Berger
O Gotoh
O Gotoh
O Gotoh
O Penn
O Westesson
P Markova-Raina
PP Gardner
RA Cartwright
RA Cartwright
RC Edgar
RC Edgar
RD Finn
RE Hickson
RK Bradley
S Guindon
S Kumar
S Kumar
S Nelesen
SB Needleman
SF Altschul
T Lassmann
TH Jukes
TH Ogden
U Roshan
W Fletcher
W Miller
Z Yang
Z Yang
Á Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Statistical estimation problems in phylogenomics and applications in microbial ecology

Author: Nute Michael Gordon
Publication venue
Publication date: 01/08/2019
Field of study

With the growing awareness of the potential for microbial communities to play a role in human health, environmental remediation and other important processes, the challenge of understanding such a complex population through the lens of high-throughput sequencing output has risen to the fore. For a de novo sequenced community, the first step to understanding the population involves comparing the sequences to a reference database in some form. In this dissertation, we consider some challenges and benefits of organizing the reference data according to evolution, with orthologous genes grouped together and stored as a multiple sequence alignment and phylogenetic tree. First we consider the related problem of estimating the population-level phylogeny of a group of species based on the alignments and phylogenies of several individual genes. Under one common model, species tree estimation is provably statistically consistent by several different methods, but those proofs rely on two separate and potentially shaky assumptions: that every species appears in the data for every gene (i.e., there is no missing data), and that since gene tree estimation is itself consistent, the gene trees used to compute the population-level tree are correct. Second, we explore some novel ways to use a Bayesian MCMC algorithm for jointly estimating alignment and phylogeny. The result is increased accuracy for large alignments, where the MCMC method alone would not be tractable. In the process, we identify a peculiar property of this Bayesian algorithm: it performs much differently on simulated sequences than on sequences from biological alignment benchmarks. No other alignment method tested showed the same divergence. Finally, we present two different practical applications a reference database containing an alignment and tree for a group of gene families in the context of microbial ecology. The first is an algorithm that uses the tree and alignment to construct an ensemble of profile hidden Markov models that improves remote homology detection. The second is a data visualization technique that generates an image of the community with a high density of data, but one that makes it naturally easy to compare many different samples at a time, potentially uncovering otherwise elusive patterns in the data

Illinois Digital Environment for Access to Learning and Scholarship Repository

StatAlign 2.0: combining statistical alignment with RNA secondary structure prediction.

Author: Anderson JW
Arunapuram P
Edvardsson I
Golden M
Hein J
Novák A
Sükösd Z
Publication venue
Publication date: 01/01/2013
Field of study

MOTIVATION: Comparative modeling of RNA is known to be important for making accurate secondary structure predictions. RNA structure prediction tools such as PPfold or RNAalifold use an aligned set of sequences in predictions. Obtaining a multiple alignment from a set of sequences is quite a challenging problem itself, and the quality of the alignment can affect the quality of a prediction. By implementing RNA secondary structure prediction in a statistical alignment framework, and predicting structures from multiple alignment samples instead of a single fixed alignment, it may be possible to improve predictions. RESULTS: We have extended the program StatAlign to make use of RNA-specific features, which include RNA secondary structure prediction from multiple alignments using either a thermodynamic approach (RNAalifold) or a Stochastic Context-Free Grammars (SCFGs) approach (PPfold). We also provide the user with scores relating to the quality of a secondary structure prediction, such as information entropy values for the combined space of secondary structures and sampled alignments, and a reliability score that predicts the expected number of correctly predicted base pairs. Finally, we have created RNA secondary structure visualization plugins and automated the process of setting up Markov Chain Monte Carlo runs for RNA alignments in StatAlign. AVAILABILITY AND IMPLEMENTATION: The software is available from http://statalign.github.com/statalign/

Oxford University Research Archive

StatAlign 2.0: combining statistical alignment with RNA secondary structure prediction

Author: Bernhart
Darty
Edgar
Griffiths-Jones
Hein
Ingolfur Edvardsson
James W. J. Anderson
Jotun Hein
Katoh
Knudsen
Knudsen
Michael Golden
Novak
Preeti Arunapuram
Sievers
Sükösd
Sükösd
Thorne
Zsuzsanna Sükösd
Ádám Novák
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref