Search CORE

130 research outputs found

GUIDANCE: a web server for assessing alignment confidence scores

Author: Castresana
D. Graur
E. Privman
G. Landan
Gatesy
Giribet
H. Ashkenazy
Katoh
Landau
Lassmann
Loytynoja
Neil
Nomaguchi
O. Penn
Poirot
Rambaut
Stoye
T. Pupko
Thompson
Wong
Publication venue: Oxford University Press
Publication date
Field of study

Evaluating the accuracy of multiple sequence alignment (MSA) is critical for virtually every comparative sequence analysis that uses an MSA as input. Here we present the GUIDANCE web-server, a user-friendly, open access tool for the identification of unreliable alignment regions. The web-server accepts as input a set of unaligned sequences. The server aligns the sequences and provides a simple graphic visualization of the confidence score of each column, residue and sequence of an alignment, using a color-coding scheme. The method is generic and the user is allowed to choose the alignment algorithm (ClustalW, MAFFT and PRANK are supported) as well as any type of molecular sequences (nucleotide, protein or codon sequences). The server implements two different algorithms for evaluating confidence scores: (i) the heads-or-tails (HoT) method, which measures alignment uncertainty due to co-optimal solutions; (ii) the GUIDANCE method, which measures the robustness of the alignment to guide-tree uncertainty. The server projects the confidence scores onto the MSA and points to columns and sequences that are unreliably aligned. These can be automatically removed in preparation for downstream analyses. GUIDANCE is freely available for use at http://guidance.tau.ac.il

Crossref

PubMed Central

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

Author: A Löytynoja
A Löytynoja
B Sipos
BG Hall
BG Hall
BP Blackburne
C Chothia
C Dessimoz
C Kemena
C Kemena
C Notredame
CB Do
CL Strope
DA Dalquen
DA Morrison
DH Mathews
ER Mardis
G Blackshields
G Jordan
G Landan
GP Raghava
I Walle Van
J Kim
J Stoye
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JH Havgaard
JP Huelsenbeck
K Mizuguchi
LA Stebbings
M Anisimova
M Pop
MR Aniba
P Gardner
RA Cartwright
RB Russell
RC Edgar
RC Edgar
SA Berger
SF Altschul
T Golubchik
T Koestler
T Lassmann
T Lassmann
T Lassmann
W Fletcher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/11/2012
Field of study

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

arXiv.org e-Print Archive

Crossref

UCL Discovery

Towards realistic benchmarks for multiple alignments of non-coding sequences

Author: A Loytynoja
A Prakash
A Prakash
A Siepel
AB Diallo
AG Clark
AP Dempster
AR Subramanian
AW Dress
B Paten
BG Hall
C Notredame
CM Bergman
D Karolchik
D Tian
DA Pollard
DA Pollard
G Bejerano
G Landan
G Landan
G Lunter
G Lunter
I Van Walle
J Felsenstein
J Kim
J Kim
J Stoye
Jaebum Kim
JD Thompson
K Katoh
K Mizuguchi
L Chindelevitch
M Blanchette
M Blanchette
M Brudno
MA Larkin
MS Rosenberg
N Bray
RA Cartwright
RC Edgar
RK Bradley
RK Bradley
S Sinha
S Snir
Saurabh Sinha
TH Ogdenw
V Simossis
W Fletcher
W Huang
W Pirovano
X He
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. Results We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments. Conclusion We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination

Author: A Loytynoja
A Loytynoja
A Mangold
A Phillips
A Stamatakis
A Stamatakis
AF Zuur
AL Hipp
Alexandros Stamatakis
B Dwivedi
B McCune
B McCune
B Staiger
BD Redelings
BD Redelings
BG Hall
BG Hall
Brendan P Hodkinson
C Moritz
CW Cunningham
D González
DF Robinson
DL Aylor
DL Swofford
DM Hillis
DT Jones
EW Price
F Lutzoni
F Ronquist
G Didier
G Didier
G Landan
G Landan
G Lunter
G Talavera
GJ Olsen
GJ Olsen
J Gatesy
J Miadlikowska
JD Lawrey
JD Thompson
K Katoh
K Katoh
K Katoh
K Kjer
K Liu
M Kimura
MA Larkin
MJ Anderson
MSY Lee
O Penn
O Penn
P Legendre
P Legendre
PD Hebert
PR Minchin
R Development Core Team
R Fleissner
R Meier
RA Cartwright
RA Cartwright
RA Cartwright
RA Cartwright
RC Edgar
Reed A Cartwright
Robert Lücking
S Karlin
S Lehtonen
S Roch
SA Berger
SA Smith
TH Ogden
TH Ogden
W Fletcher
WC Wheeler
WC Wheeler
WC Wheeler
WC Wheeler
WC Wheeler
WP Maddison
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method. Results Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model. Conclusions Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED. Availability An implementation of the PICS-Ord algorithm is available from <url>http://scit.us/projects/ngila/wiki/PICS-Ord</url>. It requires both the statistical software, R <url>http://www.r-project.org</url> and the alignment software Ngila <url>http://scit.us/projects/ngila</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Accounting For Alignment Uncertainty in Phylogenomics

Author: A Drummond
A Loytynoja
A Loytynoja
A Stamatakis
AS Schwartz
AS Schwartz
B Morgenstern
BD Redelings
BG Hall
C Dessimoz
C Notredame
CB Do
D Wu
DA Morrison
DJ States
G Landan
G Talavera
I Van Walle
J Castresana
J Felsenstein
J Pei
J Stoye
JA Lake
JD Thompson
JD Thompson
Jonathan A. Eisen
K Bucka-Lassen
K Katoh
K Liu
KM Kjer
KM Wong
M Steel
M Wu
Marco Salemi
Martin Wu
MO Dayhoff
MS Lee
MS Rosenberg
N Bray
O Penn
P Cammarano
P Kuck
R Durbin
RC Edgar
RC Edgar
RK Bradley
S Guindon
S Hartmann
Sourav Chatterji
T Lassmann
T Lassmann
T Pupko
TH Ogden
U Roshan
UW Hwang
WN Grundy
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

FigShare

A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes

Author: A Narechania
A Pavesi
A Pavesi
AL Hughes
AL Hughes
AM Pedersen
BG Barrell
CE Jones
Dan Graur
DC Krakauer
EC Holmes
F Lillo
Giddy Landan
H Okamoto
HL Zaaijer
I Makalowska
IB Rogozin
J Hein
J Montoya
J Zhang
JC Obenauer
KR Sakharkar
KS Li
L Campitelli
M Nei
N Goldman
Niv Sabath
Oliver G. Pybus
P Pamilo
PK Keese
PR Cooper
R Belshaw
R Nielsen
RA Smith
S de Groot
S de Groot
S Guyader
S McCauley
S McCauley
S Normark
SB Needleman
T Miyata
WH Li
Y Bao
Y Suzuki
Z Yang
Z Yang
Z Yang
ZI Johnson
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Inferring the intensity of positive selection in protein-coding genes is important since it is used to shed light on the process of adaptation. Recently, it has been reported that overlapping genes, which are ubiquitous in all domains of life, seem to exhibit inordinate degrees of positive selection. Here, we present a new method for the simultaneous estimation of selection intensities in overlapping genes. We show that the appearance of positive selection is caused by assuming that selection operates independently on each gene in an overlapping pair, thereby ignoring the unique evolutionary constraints on overlapping coding regions. Our method uses an exact evolutionary model, thereby voiding the need for approximation or intensive computation. We test the method by simulating the evolution of overlapping genes of different types as well as under diverse evolutionary scenarios. Our results indicate that the independent estimation approach leads to the false appearance of positive selection even though the gene is in reality subject to negative selection. Finally, we use our method to estimate selection in two influenza A genes for which positive selection was previously inferred. We find no evidence for positive selection in both cases

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Author: A Dress
A Godzik
A Löytynoja
A Löytynoja
A Novák
A Novák
A Sali
A Siepel
A Tramontano
Adrienn Szabó
AS Schwartz
AS Schwartz
B Dwivedi
B Knudsen
B Larget
B Misof
B Schwikowski
BD Redelings
BD Redelings
BJM Webb
BP Blackburne
C Dessimoz
C Notredame
C Notredame
CB Do
CJ Challis
D Altschuh
D Chivian
D DeBlasio
D Lupyan
D Metzler
D Metzler
D Robinson
DA Morrison
DF Feng
E Levy Karin
G Jordan
G Landan
G Lunter
G Lunter
G Lunter
G Raghava
G Talavera
GA Churchill
GA Lunter
Hall B G
HT Mevissen
I Holmes
I Miklós
I Miklós
IL Dryden
IM Wallace
István Miklós
J Castresana
J Felsenstein
J Gatesy
J Hein
J Kim
J Zhu
JA Lake
JD Thompson
JD Thompson
JL Thorne
JL Thorne
JL Thorne
JL Thorne
Joseph L Herman
Jotun Hein
K Bucka-Lassen
K Liu
K Liu
KM Wong
L Wang
L Yu
LE Carvalho
LS Wang
M Hamada
M Hamada
M Hamada
M Höhl
M Vingron
M Vingron
M Wu
M Zuker
MA Suchard
MJ Wise
MO Dayhoff
MP Simmons
MS Waterman
MSY Lee
O Gotoh
O Penn
O Penn
O Penn
P Ajawatanawong
P Arunapuram
P Collingridge
PJ Green
PJ Green
PP Gardner
R Durbin
R Satija
R Satija
R Schwarzenbacher
RA Cartwright
RC Edgar
RJ Dickson
RJ Dickson
RK Bradley
Rune Lyngsø
S Capella-Gutiérrez
S Karlin
S Miyazawa
S Needleman
S Sinha
Silla-Martínez Capella-Gutiérrez S
SME Sahraeian
TA Hopf
TH Ogden
TL Blundell
U Roshan
V Ahola
W Fletcher
WC Wheeler
Y Liu
Y Ruffieux
Ádám Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

Crossref

SZTAKI Publication Repository

Springer - Publisher Connector

PubMed Central

Oxford University Research Archive

Informational Gene Phylogenies Do Not Support a Fourth Domain of Life for Nucleocytoplasmic Large DNA Viruses

Author: A Mooers
A Stamatakis
A Stamatakis
BL Scola
BR Holland
C Notredame
C Vossbrinck
CJ Cox
CR Woese
CR Woese
D Moreira
D Moreira
D Moreira
D Moreira
D Raoult
D Raoult
DT Jones
E Susko
Eva Heinz
F Abascal
FD Ciccarelli
G Landan
H Philippe
H Philippe
I Hrdy
IM Wallace
J Castresana
J Felsenstein
J Filée
Ja Lake
Ja Lake
JP Bollback
LS Quang
M Boyer
MC Rivera
MJ Phillips
MM Miyamoto
MO Dayhoff
N Lartillot
N Lartillot
N Lartillot
P Forterre
P Foster
PJ Lockhart
R Nielsen
RC Edgar
Rosie Redfield
RP Hirt
S Blanquart
S Guindon
S Whelan
SQ Le
SQ Le
T Dagan
T. Martin Embley
TM Embley
TM Embley
Tom A. Williams
V Hampl
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Mimivirus is a nucleocytoplasmic large DNA virus (NCLDV) with a genome size (1.2 Mb) and coding capacity ( 1000 genes) comparable to that of some cellular organisms. Unlike other viruses, Mimivirus and its NCLDV relatives encode homologs of broadly conserved informational genes found in Bacteria, Archaea, and Eukaryotes, raising the possibility that they could be placed on the tree of life. A recent phylogenetic analysis of these genes showed the NCLDVs emerging as a monophyletic group branching between Eukaryotes and Archaea. These trees were interpreted as evidence for an independent “fourth domain” of life that may have contributed DNA processing genes to the ancestral eukaryote. However, the analysis of ancient evolutionary events is challenging, and tree reconstruction is susceptible to bias resulting from non-phylogenetic signals in the data. These include compositional heterogeneity and homoplasy, which can lead to the spurious grouping of compositionally-similar or fast-evolving sequences. Here, we show that these informational gene alignments contain both significant compositional heterogeneity and homoplasy, which were not adequately modelled in the original analysis. When we use more realistic evolutionary models that better fit the data, the resulting trees are unable to reject a simple null hypothesis in which these informational genes, like many other NCLDV genes, were acquired by horizontal transfer from eukaryotic hosts. Our results suggest that a fourth domain is not required to explain the available sequence data

Crossref

Directory of Open Access Journals

PubMed Central

Explore Bristol Research

Evolutionary Patterning: A Novel Approach to the Identification of Potential Drug Target Sites in Plasmodium falciparum

Malaria continues to be the most lethal protozoan disease of humans. Drug development programs exhibit a high attrition rate and parasite resistance to chemotherapeutic drugs exacerbates the problem. Strategies that limit the development of resistance and minimize host side-effects are therefore of major importance. In this study, a novel approach, termed evolutionary patterning (EP), was used to identify suitable drug target sites that would minimize the emergence of parasite resistance. EP uses the ratio of non-synonymous to synonymous substitutions (ω) to assess the patterns of evolutionary change at individual codons in a gene and to identify codons under the most intense purifying selection (ω≤0.1). The extreme evolutionary pressure to maintain these residues implies that resistance mutations are highly unlikely to develop, which makes them attractive chemotherapeutic targets. Method validation included a demonstration that none of the residues providing pyrimethamine resistance in the Plasmodium falciparum dihydrofolate reductase enzyme were under extreme purifying selection. To illustrate the EP approach, the putative P. falciparum glycerol kinase (PfGK) was used as an example. The gene was cloned and the recombinant protein was active in vitro, verifying the database annotation. Parasite and human GK gene sequences were analyzed separately as part of protozoan and metazoan clades, respectively, and key differences in the evolutionary patterns of the two molecules were identified. Potential drug target sites containing residues under extreme evolutionary constraints were selected. Structural modeling was used to evaluate the functional importance and drug accessibility of these sites, which narrowed down the number of candidates. The strategy of evolutionary patterning and refinement with structural modeling addresses the problem of targeting sites to minimize the development of drug resistance. This represents a significant advance for drug discovery programs in malaria and other infectious diseases

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Author: A Löytynoja
A Löytynoja
AB Diallo
AB Diallo
AR Subramanian
B Marsden
B Paten
BP Blackburne
C Notredame
C Notredame
CB Do
CL Strope
D Feng
D Graur
D Gusfield
D Villar
DA Morrison
EA O’Brien
EL Braun
G Landan
G Landan
G Landan
G Lunter
G Lunter
GA Lunter
I Holmes
I Holmes
I Walle Van
J Felsenstein
J Felsenstein
J Kim
J Kim
J Pei
JA Eisen
JD Thompson
JD Thompson
JD Thompson
JM Chang
JS Farris
K Arnold
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Katoh
K Katoh
K Katoh
Kiyoshi Ezawa
KM Wong
KS Pollard
L Chindelevitch
L Wang
LA Stebbings
LM Wallace
M Lynch
MA Suchard
MP Berger
O Gotoh
O Gotoh
O Gotoh
O Penn
O Westesson
P Markova-Raina
PP Gardner
RA Cartwright
RA Cartwright
RC Edgar
RC Edgar
RD Finn
RE Hickson
RK Bradley
S Guindon
S Kumar
S Kumar
S Nelesen
SB Needleman
SF Altschul
T Lassmann
TH Jukes
TH Ogden
U Roshan
W Fletcher
W Miller
Z Yang
Z Yang
Á Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref