Search CORE

arXiv.org e-Print Archive

UCL Discovery

Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

Author: A Heger
A Löytynoja
A Löytynoja
A Siepel
A Siepel
A Siepel
AG Clark
AM Moses
Art F. Y. Poon
B Knudsen
B Paten
B Rannala
Benedict Paten
C Lee
C Strope
DG Higgins
EF Moore
FA Matsen
FR Kschischang
G Lunter
Gerton Lunter
I Holmes
I Miklós
Ian Holmes
J Felsenstein
JD Thompson
JL Thorne
JL Thorne
JS Pedersen
K Katoh
K Liu
KM Wong
KS Pollard
L Gomez-Valero
L Zhu
M Larkin
M Mohri
MA Suchard
N de la Chaux
O Kamneva
O Westesson
Oscar Westesson
P Markova-Raina
R Mills
RA Cartwright
RC Edgar
RK Bradley
RK Bradley
S Nelesen
S Saccone
S Sinha
T Beissbarth
X Qu
Z Wang
Z Yang
Z Yang
Z Yang
Z Zhang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

Oxford University Research Archive

The Francis Crick Institute

Evolutionary distances in the twilight zone -- a rational kernel approach

Author: A Keller
A Löytynoja
A Stamatakis
B Chor
B Schölkopf
Benjamin Merget
C Cortes
C Daskalakis
CB Do
E Rivas
F Bemm
Florian Markowetz
Frank Förster
G Talavera
HH Otu
I Ulitsky
J Felsenstein
J Friedrich
J Hein
JL Thorne
JL Thorne
Jörg Schultz
KM Wong
LS Wang
M Höhl
M Höhl
M Mohri
M Mohri
M Wolf
MA Buchheim
MA Suchard
Matthias Wolf
MJ Bishop
MK Kuhner
MS Waterman
N Goldman
N Higham
R Durbin
RC Edgar
RF Doolittle
Roland F. Schwarz
S Roch
S Whelan
SR Eddy
T Mailund
T Müller
TH Ogden
V Levenshtein
W Fletcher
W Fletcher
Wayne Delport
William Fletcher
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/11/2010
Field of study

Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.Comment: to appear in PLoS ON

arXiv.org e-Print Archive

MDC Repository

Vitellogenin Underwent Subfunctionalization to Acquire Caste and Behavioral Specific Expression in the Harvester Ant Pogonomyrmex barbatus

Author: A Bourke
A Dolezal
A Khila
A Li
A Löytynoja
A Stamatakis
A Toth
A Tóth
AFG Bourke
C Holt
C Kent
C Lucas
C Smith
C Smith
C Smith
CM Nelson
CS Moreau
D Bates
D Cardoen
D Gordon
DS Marco Antonio
EO Wilson
G Suen
GE Robinson
GV Amdam
GV Amdam
GV Amdam
H Havukainen
H Lin
J Hancock
J Wang
JD Thompson
Jianzhi Zhang
JT Jackson
K Crailsheim
K Ingram
K Ingram
K Katoh
KJ Livak
Laurent Keller
M Andersson
M Corona
M Hawkins
M Piulachs
M Scharf
MH Haydak
Miguel Corona
N Franks
N Goto
Oksana Riba-Grognuz
P Babin
R Acher
R Bonasio
R Gadagkar
R Page
Romain A. Studer
Romain Libbrecht
S Blank
S Camazine
S Capella-Gutiérrez
S Capella-Gutiérrez
S Cardinal
S Khalil
S Lewis
S Nygaard
T Cremonez
T Fujita
T Junier
T Trenczek
W Engels
W Rutz
Y Ben-Shahar
Y Wurm
Y Wurm
Yannick Wurm
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

PMCID: PMC3744404This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

CiteSeerX

Serveur académique lausannois

Queen Mary Research Online

The Francis Crick Institute

How reliably can we predict the reliability of protein structure predictions?

Author: A Drummond
A Krogh
A Löytynoja
A Löytynoja
B Knudsen
B Redelings
Balázs Dombai
D Gusfield
D Kneller
D Metzler
DF Feng
F Ronquist
G Lunter
G Lunter
H Zhou
I Holmes
I Holmes
I Holmes
I Holmes
I Miklós
István Miklós
J Felsenstein
J Garnier
J Kececioglu
J Skolnick
JL Thorne
JL Thorne
Jotun Hein
K Karplus
K Mizuguchi
K Mizuguchi
L Wang
M Dayhoff
M Suchard
M Waterman
M Waterman
N Goldman
N Metropolis
O Gotoh
P Hogeweg
R Bradley
R Durbin
R Fleissner
S Eddy
S Wu
SB Needleman
T Hubbard
TF Smith
W Hastings
W Press
Ádám Novák '
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: Comparative methods have been the standard techniques for in silico protein structure prediction. The prediction is based on a multiple alignment that contains both reference sequences with known structures and the sequence whose unknown structure is predicted. Intensive research has been made to improve the quality of multiple alignments, since misaligned parts of the multiple alignment yield misleading predictions. However, sometimes all methods fail to predict the correct alignment, because the evolutionary signal is too weak to find the homologous parts due to the large number of mutations that separate the sequences. Results: Stochastic sequence alignment methods define a posterior distribution of possible multiple alignments. They can highlight the most likely alignment, and above that, they can give posterior probabilities for each alignment column. We made a comprehensive study on the HOMSTRAD database of structural alignments, predicting secondary structures in four different ways. We showed that alignment posterior probabilities correlate with the reliability of secondary structure predictions, though the strength of the correlation is different for different protocols. The correspondence between the reliability of secondary structure predictions and alignment posterior probabilities is the closest to the identity function when the secondary structure posterior probabilities are calculated from the posterior distribution of multiple alignments. The largest deviation from the identity function has been obtained in the case of predicting secondary structures from a single optimal pairwise alignment. We also showed that alignment posterior probabilities correlate with the 3D distances between C α amino acids in superimposed tertiary structures. Conclusion: Alignment posterior probabilities can be used to a priori detect errors in comparative models on the sequence alignment level. </p

CiteSeerX

SZTAKI Publication Repository

Springer - Publisher Connector

Oxford University Research Archive

ELTE Digital Institutional Repository (EDIT)

Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment

Author: A Löytynoja
A Löytynoja
A Siepel
AB Diallo
AJ Westphal
Alexander Kawrykow
Alfred Kam
AM Waterhouse
B Knudsen
B Paten
BN Chorley
C Notredame
Chu Wu
Clarence Leung
D Sankoff
Daniel Kwak
E Korpela
Eleyine Zarour
F Khatib
Gary Roumanis
GG Loots
J Amberger
JS Pedersen
Jérôme Waldispühl
K Land
K Lindblad-Toh
L Chindelevitch
L von Ahn
L von Ahn
L Wang
Luis Sarmenta
M Blanchette
M Blanchette
M Blanchette
M Brudno
M Gouy
M Kellis
M Shirts
Mathieu Blanchette
N Bray
PA Fujita
Pawel Michalak
PC Ng
S Cooper
S De
S Schwartz
SB Needleman
T Jiang
W Fletcher
W Miller
WM Fitch
Publication venue: Public Library of Science
Publication date: 07/03/2012
Field of study

BACKGROUND: Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. METHODOLOGY/PRINCIPAL FINDINGS: We introduce Phylo, a human-based computing framework applying "crowd sourcing" techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. CONCLUSIONS/SIGNIFICANCE: We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of "human-brain peta-flops" of computation that are spent every day playing games. Phylo is available at: http://phylo.cs.mcgill.ca

Predicting Bevirimat resistance of HIV-1 from genotype

Author: A Kernytsky
A Löytynoja
AD Sevin
C Cole
C Notredame
CS Adamson
CS Adamson
D Heider
D Nguyen
D Wang
Daniel Hoffmann
DK Worthylake
Dominik Heider
E Frank
ER Wright
F Li
F Li
F Wilcoxon
GC Cawley
HB Shen
IH Witten
J Demsar
J Kingston
J Kyte
J Thompson
J Verheyen
J Zhou
Jens Verheyen
K Salzwedel
K Salzwedel
KC Chou
KV Baelen
L Breiman
L Nanni
M Borschbach
M Miller
M Riedmiller
MA Accola
N Beerenwinkel
N Beerenwinkel
N Margot
N Morellet
R Development Core Team
R King
R Lathrop
RC Edgar
RE Banfield
RJ Murray
S Draghici
S McCallister
S Ong
S Tzafestas
SR Eddy
T Fawcett
T Sing
W Resch
WW Cohen
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Maturation inhibitors are a new class of antiretroviral drugs. Bevirimat (BVM) was the first substance in this class of inhibitors entering clinical trials. While the inhibitory function of BVM is well established, the molecular mechanisms of action and resistance are not well understood. It is known that mutations in the regions CS p24/p2 and p2 can cause phenotypic resistance to BVM. We have investigated a set of p24/p2 sequences of HIV-1 of known phenotypic resistance to BVM to test whether BVM resistance can be predicted from sequence, and to identify possible molecular mechanisms of BVM resistance in HIV-1. Results We used artificial neural networks and random forests with different descriptors for the prediction of BVM resistance. Random forests with hydrophobicity as descriptor performed best and classified the sequences with an area under the Receiver Operating Characteristics (ROC) curve of 0.93 ± 0.001. For the collected data we find that p2 sequence positions 369 to 376 have the highest impact on resistance, with positions 370 and 372 being particularly important. These findings are in partial agreement with other recent studies. Apart from the complex machine learning models we derived a number of simple rules that predict BVM resistance from sequence with surprising accuracy. According to computational predictions based on the data set used, cleavage sites are usually not shifted by resistance mutations. However, we found that resistance mutations could shorten and weaken the <it>α</it>-helix in p2, which hints at a possible resistance mechanism. Conclusions We found that BVM resistance of HIV-1 can be predicted well from the sequence of the p2 peptide, which may prove useful for personalized therapy if maturation inhibitors reach clinical practice. Results of secondary structure analysis are compatible with a possible route to BVM resistance in which mutations weaken a six-helix bundle discovered in recent experiments, and thus ease Gag cleavage by the retroviral protease.</p

Springer - Publisher Connector

Functional opsin retrogene in nocturnal moth

Author: A Löytynoja
A Morris
AC Marques
AD Briscoe
BJ Eriksson
BQ Minh
C Trapnell
F Abascal
FA Kondrashov
FA Kondrashov
H Innan
H Kaessman
H Kaessman
J Fitzgibbon
J Neitz
J Zhang
LT Nguyen
M Liegertova
MV Han
N Lartillot
P Jeffs
P Xu
P Zhang
R Feuda
R Feuda
R Nielsen
RA Velarde
S Ohno
SG Solomon
T Zemojtel
TM Keane
W Qian
Y Bai
Y Nakane
Z Yang
Z Zhang
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Author: A Löytynoja
B Chevreux
B Morgenstern
C Notredame
CNS Pedersen
D Huchon
D Przybylski
D Sankoff
D Zheng
DG Higgins
E Dermitzakis
Emmanuel J. P. Douzery
F Abascal
F Delsuc
Frédéric Delsuc
H Philippe
H Zhao
J Hein
J Kececioglu
J Kececioglu
J Raes
JD Thompson
K Katoh
KM Wong
L Arvestad
L Salmela
M Dayhoff
M Gouy
M Kircher
M Margulies
M Suyama
MT Gilbert
N Galtier
OR Bininda-Emonds
P Sneath
PJ Farabaugh
R Wernersson
RC Edgar
RC Edgar
RK Bradley
RR Stocsits
RW Meredith
S Henikoff
S Needleman
SF Altschul
SF Altschul
SS Steiger
Sébastien Harispe
T Smith
TA Demere
TJ Hubbard
TJ Wheeler
V Ranwez
Vincent Ranwez
William J. Murphy
X Guan
X Huang
Y Van de Peer
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment

Tidying Up International Nucleotide Sequence Databases: Ecological, Geographical and Sequence Quality Annotation of ITS Sequences of Mycorrhizal Fungi

Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS) region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD) are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/) for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/), the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi

Aberdeen University Research