Search CORE

Caltech Authors

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Author: A Dress
A Godzik
A Löytynoja
A Löytynoja
A Novák
A Novák
A Sali
A Siepel
A Tramontano
Adrienn Szabó
AS Schwartz
AS Schwartz
B Dwivedi
B Knudsen
B Larget
B Misof
B Schwikowski
BD Redelings
BD Redelings
BJM Webb
BP Blackburne
C Dessimoz
C Notredame
C Notredame
CB Do
CJ Challis
D Altschuh
D Chivian
D DeBlasio
D Lupyan
D Metzler
D Metzler
D Robinson
DA Morrison
DF Feng
E Levy Karin
G Jordan
G Landan
G Lunter
G Lunter
G Lunter
G Raghava
G Talavera
GA Churchill
GA Lunter
Hall B G
HT Mevissen
I Holmes
I Miklós
I Miklós
IL Dryden
IM Wallace
István Miklós
J Castresana
J Felsenstein
J Gatesy
J Hein
J Kim
J Zhu
JA Lake
JD Thompson
JD Thompson
JL Thorne
JL Thorne
JL Thorne
JL Thorne
Joseph L Herman
Jotun Hein
K Bucka-Lassen
K Liu
K Liu
KM Wong
L Wang
L Yu
LE Carvalho
LS Wang
M Hamada
M Hamada
M Hamada
M Höhl
M Vingron
M Vingron
M Wu
M Zuker
MA Suchard
MJ Wise
MO Dayhoff
MP Simmons
MS Waterman
MSY Lee
O Gotoh
O Penn
O Penn
O Penn
P Ajawatanawong
P Arunapuram
P Collingridge
PJ Green
PJ Green
PP Gardner
R Durbin
R Satija
R Satija
R Schwarzenbacher
RA Cartwright
RC Edgar
RJ Dickson
RJ Dickson
RK Bradley
Rune Lyngsø
S Capella-Gutiérrez
S Karlin
S Miyazawa
S Needleman
S Sinha
Silla-Martínez Capella-Gutiérrez S
SME Sahraeian
TA Hopf
TH Ogden
TL Blundell
U Roshan
V Ahola
W Fletcher
WC Wheeler
Y Liu
Y Ruffieux
Ádám Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

SZTAKI Publication Repository

Springer - Publisher Connector

Oxford University Research Archive

Probabilistic Phylogenetic Inference with Insertions and Deletions

Author: A Pang
A Siepel
A Siepel
A Stamatakis
AD Smith
B Boussau
B Knudsen
B Knudsen
B Knudsen
B Larget
B Mau
B Mau
B Qian
B Qian
B Qian
B Rannala
C Kosiol
C Moler
D Metzler
D Simon
David Haussler
DF Robinson
DG Hwang
DL Swofford
E Rivas
Elena Rivas
F Ronquist
G Lunter
G Lunter
G Lunter
G McGuire
GA Churchill
GJ Mitchison
GJ Mitchison
I Holmes
I Holmes
I Holmes
I Miklós
I Miklós
J Adachi
J Felsenstein
J Felsenstein
J Felsenstein
J Felsenstein
J Hein
J Hein
J Hein
J Kim
J Stoye
J Wang
JD McAuliffe
JJ Cannone
JL Thorne
JL Thorne
JL Thorne
JP Huelsenbeck
JS Pedersen
L Chindelevitch
L Coin
M Blanchette
M Dayhoff
M Gribskov
M Hasegawa
M Kimura
M Steel
MJ Bishop
MK Kuhner
MS Chang
N Goldman
P Liò
PD Keightley
R Durbin
R Fleissner
S Guindon
S Karlin
S Tavaré
S Whelan
Sean R. Eddy
SV Muse
TH Jukes
W Cai
Z Yang
Z Yang
Z Yang
Z Yang
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnamlε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm

CiteSeerX

Public Library of Science (PLOS)

A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes

Author: A Varriale
AL Hughes
AP Bird
E Scarano
F Antequera
F Vogel
G Lunter
GA Huttley
J Felsenstein
J Sullivan
J Taylor
JC Walser
JL Leroy
K Katoh
K Misawa
K Misawa
K Misawa
K Misawa
Kazuharu Misawa
KJ Fryxell
KJ Fryxell
M Krawczak
M Nei
MA Larkin
R Development Core Team
R Grantham
RA Gibbs
S Horai
S Kaneko
S Tyekucheva
SF Altschul
T Miyata
TH Jukes
WH Li
Y Suzuki
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Developing a model for codon substitutions is essential for the analyses of protein sequences. Recent studies on the mutation rates in the non-coding regions have shown that CpG mutation rates in the human genome are negatively correlated to the local GC content and to the densities of functional elements. This study aimed at understanding the effect of genomic features, namely, GC content, gene density, and frequency of CpG islands, on the rates of codon substitution in human chromosomes. Results Codon substitution rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions in humans were estimated by comparing the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by using macaque genes as the outgroup. Since the genomic features are depending on each other, partial regression coefficients of these features were obtained. Conclusion The substitution rates of codons depend on gene densities of the chromosomes. Transcription-associated mutation is one such pressure. On the basis of these results, a model of codon substitutions that incorporates the effect of genomic features on codon substitution in human chromosomes was developed.</p

Springer - Publisher Connector

Relationship between amino acid composition and gene expression in the mouse genome

Abstract Background Codon bias is a phenomenon that refers to the differences in the frequencies of synonymous codons among different genes. In many organisms, natural selection is considered to be a cause of codon bias because codon usage in highly expressed genes is biased toward optimal codons. Methods have previously been developed to predict the expression level of genes from their nucleotide sequences, which is based on the observation that synonymous codon usage shows an overall bias toward a few codons called major codons. However, the relationship between codon bias and gene expression level, as proposed by the translation-selection model, is less evident in mammals. Findings We investigated the correlations between the expression levels of 1,182 mouse genes and amino acid composition, as well as between gene expression and codon preference. We found that a weak but significant correlation exists between gene expression levels and amino acid composition in mouse. In total, less than 10% of variation of expression levels is explained by amino acid components. We found the effect of codon preference on gene expression was weaker than the effect of amino acid composition, because no significant correlations were observed with respect to codon preference. Conclusion These results suggest that it is difficult to predict expression level from amino acid components or from codon bias in mouse.</p

Springer - Publisher Connector

Whole-chromosome hitchhiking driven by a male-killing endosymbiont.

Author: A Bankevich
A Duplouy
A Mackintosh
A Martin
A Mazo-Vargas
A Rambaut
AE Wright
AF Rives
AJ Mongue
B. Charlesworth
CA Clarke
CC Chang
CR Lee
D Bachtrog
D Bryant
D Charlesworth
D. Charlesworth
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DAS Smith
DF Owen
DH Huson
E Idris
EA Hornett
EL Westerman
EM Leffler
FA Simão
FM Jiggins
FM Jiggins
FM Jiggins
G Lunter
G Lushai
GA Van der Auwera
GDD Hurst
GDD Hurst
Heliconius Genome Consortium T.
I Pala
IJ Gordon
IJ Gordon
J Kitano
JA Balbuena
JC Fay
JK Herren
JM Coughlan
JW Davey
JW Davey
K Kunte
KK Dasmahapatra
L Wilfert
L Zhang
LP Pryszcz
LZ Carabajal Paladino
M a DePristo
M Kirkpatrick
M Steinemann
MF Palopoli
MF Richardson
MJ Thompson
N Dierckxsens
NJ Nadeau
O Delaneau
O Delaneau
P Nguyen
PD Keightley
PJ Wittkopp
R Bouckaert
R Bürger
R Overbeek
RD Reed
RF Guerrero
RK Aziz
RR Bouckaert
RR Bracewell
S Guindon
S Huang
S Smith D a
S Zhan
SH Martin
SM Van Belleghem
V Ahola
W Traut
WR Rice
Publication venue: PLoS Biol
Publication date: 01/02/2020
Field of study

Neo-sex chromosomes are found in many taxa, but the forces driving their emergence and spread are poorly understood. The female-specific neo-W chromosome of the African monarch (or queen) butterfly Danaus chrysippus presents an intriguing case study because it is restricted to a single 'contact zone' population, involves a putative colour patterning supergene, and co-occurs with infection by the male-killing endosymbiont Spiroplasma. We investigated the origin and evolution of this system using whole genome sequencing. We first identify the 'BC supergene', a broad region of suppressed recombination across nearly half a chromosome, which links two colour patterning loci. Association analysis suggests that the genes yellow and arrow in this region control the forewing colour pattern differences between D. chrysippus subspecies. We then show that the same chromosome has recently formed a neo-W that has spread through the contact zone within approximately 2,200 years. We also assembled the genome of the male-killing Spiroplasma, and find that it shows perfect genealogical congruence with the neo-W, suggesting that the neo-W has hitchhiked to high frequency as the male-killer has spread through the population. The complete absence of female crossing-over in the Lepidoptera causes whole-chromosome hitchhiking of a single neo-W haplotype, carrying a single allele of the BC supergene and dragging multiple non-synonymous mutations to high frequency. This has created a population of infected females that all carry the same recessive colour patterning allele, making the phenotypes of each successive generation highly dependent on uninfected male immigrants. Our findings show how hitchhiking can occur between the physically unlinked genomes of host and endosymbiont, with dramatic consequences

Edinburgh Research Explorer

Open Research Exeter

Enlighten

Apollo (Cambridge)

Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa

Author: A Daniel
AT Hamilton
BA Sullivan
CC Laurie
Chris P. Ponting
CJ Conroy
CM Wade
CR Darwin
CS Carlson
CT Ting
CT Ting
D Schmidt
D Vermaak
DA Hinds
DC Presgraves
DE Perez
E Mayr
F Pardo-Manuel de Villena
F Pardo-Manuel de Villena
F Tajima
GA Dover
Gerton Lunter
GP Smith
H Santos-Rosa
HA Orr
Harmit S. Malik
HG Yu
HJ Muller
HR Lee
HS Malik
HS Malik
I Fumasoni
I Letunic
J Chaline
J Coyne
JJ Bayes
JM Good
JM Good
Joshua J. Bayes
JP Masly
K Hayashi
K Sawamura
Kevin C. Roach
L Fishman
L Fishman
L Goodstadt
LD Hurst
Leo Goodstadt
M Buhler
M Nakano
M Nei
M Nei
M Shannon
M Vyskocilova
M Zofall
MG Schueler
Michael W. Nachman
MK Rudd
N Phadnis
NH Putnam
Nitin Phadnis
NJ Brideau
O Mihola
PC Sabeti
Peter L. Oliver
R Storchova
RC Edgar
RH Devlin
RO Emerson
S Guindon
S Henikoff
S Henikoff
S Irie
S Steppan
S Sun
SA Frank
Scott A. Beatson
SI Grewal
SR Eddy
T Massingham
T Miyamoto
T Ohta
TA Volpe
W Zhai
XJ Sun
Y Choo
Y Choo
Y Choo
Y Tao
Z Birtle
Z Trachtulec
Z Yang
Zoë Birtle
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The onset of prezygotic and postzygotic barriers to gene flow between populations is a hallmark of speciation. One of the earliest postzygotic isolating barriers to arise between incipient species is the sterility of the heterogametic sex in interspecies' hybrids. Four genes that underlie hybrid sterility have been identified in animals: Odysseus, JYalpha, and Overdrive in Drosophila and Prdm9 (Meisetz) in mice. Mouse Prdm9 encodes a protein with a KRAB motif, a histone methyltransferase domain and several zinc fingers. The difference of a single zinc finger distinguishes Prdm9 alleles that cause hybrid sterility from those that do not. We find that concerted evolution and positive selection have rapidly altered the number and sequence of Prdm9 zinc fingers across 13 rodent genomes. The patterns of positive selection in Prdm9 zinc fingers imply that rapid evolution has acted on the interface between the Prdm9 protein and the DNA sequences to which it binds. Similar patterns are apparent for Prdm9 zinc fingers for diverse metazoans, including primates. Indeed, allelic variation at the DNA–binding positions of human PRDM9 zinc fingers show significant association with decreased risk of infertility. Prdm9 thus plays a role in determining male sterility both between species (mouse) and within species (human). The recurrent episodes of positive selection acting on Prdm9 suggest that the DNA sequences to which it binds must also be evolving rapidly. Our findings do not identify the nature of the underlying DNA sequences, but argue against the proposed role of Prdm9 as an essential transcription factor in mouse meiosis. We propose a hypothetical model in which incompatibilities between Prdm9-binding specificity and satellite DNAs provide the molecular basis for Prdm9-mediated hybrid sterility. We suggest that Prdm9 should be investigated as a candidate gene in other instances of hybrid sterility in metazoans

CiteSeerX

Public Library of Science (PLOS)

Oxford University Research Archive

University of Queensland eSpace

Sequencing and de novo assembly of 150 genomes from Denmark as a population reference

Author: A Helgason
A Kong
A Telenti
AD Børglum
Ali Syed
Anders D. Børglum
Anders E. Halager
Anders Krogh
Bent Petersen
BJ Stucky
Chen Ye
Christian N. S. Pedersen
Christian Theil Have
Christina M. Hultman
David Westergaard
DF Gudbjartsson
Esben Flindt
Francesco Lescai
G Lunter
GA Van der Auwera
GD Poznik
GM Cooper
H Cao
H Eiberg
H Kupfermann
H Li
H Li
H Li
Hans Eiberg
Hongzhi Cao
J Huddleston
Jacob Malte Jensen
Jakob Grove
Jette Bork-Jensen
Jihua Sun
Johan van Beusekom
Jonas Andreas Sibbesen
Jose M. G. Izarzugaza
JS Seo
JT Simpson
Jun Wang
Junhua Rao
K Katoh
K Tamura
Karsten Kristiansen
Kirstine Belling
KM Steinberg
L Paternoster
Lars Bolund
Lasse Maretty
Laurits Skov
LC Francioli
M Lek
M Nothnagel
M Oven
M Pendleton
MA Eberle
Maria Luisa Matey-Hernandez
Marie Grosjean
MC Frith
Mikkel Heide Schierup
MR Hoehe
Ning Li
Ole Lund
Ole Mors
Oluf Pedersen
P Rice
Palle Villesen
Patrick Sullivan
Peter Løngren
PH Sudmant
PL Auer
R Hubley
R Luo
Rachita Yadav
Ramneek Gupta
Ruiqi Xu
Rune M. Friborg
S Besenbacher
S Deorowicz
S Gnerre
S Liu
S Ripke
SF Altschul
Shengting Li
Shujia Huang
Simon Rasmussen
Siyang Liu
SM Kiełbasa
Stephanie Le Hellard
Søren Besenbacher
Søren Brunak
T Espeseth
T Magocˇ
Thomas D. Als
Thomas Espeseth
Thomas Mailund
Thomas Sicheritz-Pontén
Thorkild I. A. Sørensen
Torben Hansen
VA Schneider
Weijian Ye
WP Kloosterman
WS Wong
Xiaosen Guo
Xun Xu
Yuqi Chang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark

Copenhagen University Research Information System

Carolina Digital Repository

Online Research Database In Technology

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Author: A Löytynoja
A Löytynoja
AB Diallo
AB Diallo
AR Subramanian
B Marsden
B Paten
BP Blackburne
C Notredame
C Notredame
CB Do
CL Strope
D Feng
D Graur
D Gusfield
D Villar
DA Morrison
EA O’Brien
EL Braun
G Landan
G Landan
G Landan
G Lunter
G Lunter
GA Lunter
I Holmes
I Holmes
I Walle Van
J Felsenstein
J Felsenstein
J Kim
J Kim
J Pei
JA Eisen
JD Thompson
JD Thompson
JD Thompson
JM Chang
JS Farris
K Arnold
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Ezawa
K Katoh
K Katoh
K Katoh
Kiyoshi Ezawa
KM Wong
KS Pollard
L Chindelevitch
L Wang
LA Stebbings
LM Wallace
M Lynch
MA Suchard
MP Berger
O Gotoh
O Gotoh
O Gotoh
O Penn
O Westesson
P Markova-Raina
PP Gardner
RA Cartwright
RA Cartwright
RC Edgar
RC Edgar
RD Finn
RE Hickson
RK Bradley
S Guindon
S Kumar
S Kumar
S Nelesen
SB Needleman
SF Altschul
T Lassmann
TH Jukes
TH Ogden
U Roshan
W Fletcher
W Miller
Z Yang
Z Yang
Á Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Contributions of protein-coding and regulatory change to adaptive molecular evolution in murid rodents

Author: A Cox
A Doherty
A Eyre-Walker
A Eyre-Walker
A Kousathanas
A Kousathanas
A Siepel
AR Boyko
Athanasios Kousathanas
B Charlesworth
Bettina Harr
BM Peter
Bret A. Payseur
CB Lowe
D Graur
Daniel L. Halligan
David J. Adams
DG Torgerson
DL Halligan
DL Halligan
FC Jones
G Lunter
G McVicker
GA Wray
H Li
H Li
HE Hoekstra
JF Baines
JH McDonald
JJ Cai
JK Pritchard
JV Chamary
K Lindblad-Toh
L Eőry
LD Ward
Lél Eöry
M Carnerio
M Kimura
M Nordborg
M Phifer-Rixey
M Przeworski
M-C King
MI Jensen-Seaman
P Andolfatto
P Andolfatto
PD Keightley
PD Keightley
PD Keightley
PD Keightley
Peter D. Keightley
PW Messer
RD Hernandez
Rob W. Ness
S Sattath
SB Carroll
T Salcedo
THE Wiehe
Thomas M. Keane
WF Doolittle
Y Shen
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

The contribution of regulatory versus protein change to adaptive evolution has long been controversial. In principle, the rate and strength of adaptation within functional genetic elements can be quantified on the basis of an excess of nucleotide substitutions between species compared to the neutral expectation or from effects of recent substitutions on nucleotide diversity at linked sites. Here, we infer the nature of selective forces acting in proteins, their UTRs and conserved noncoding elements (CNEs) using genome-wide patterns of diversity in wild house mice and divergence to related species. By applying an extension of the McDonald-Kreitman test, we infer that adaptive substitutions are widespread in protein-coding genes, UTRs and CNEs, and we estimate that there are at least four times as many adaptive substitutions in CNEs and UTRs as in proteins. We observe pronounced reductions in mean diversity around nonsynonymous sites (whether or not they have experienced a recent substitution). This can be explained by selection on multiple, linked CNEs and exons. We also observe substantial dips in mean diversity (after controlling for divergence) around protein-coding exons and CNEs, which can also be explained by the combined effects of many linked exons and CNEs. A model of background selection (BGS) can adequately explain the reduction in mean diversity observed around CNEs. However, BGS fails to explain the wide reductions in mean diversity surrounding exons (encompassing ~100 Kb, on average), implying that there is a substantial role for adaptation within exons or closely linked sites. The wide dips in diversity around exons, which are hard to explain by BGS, suggest that the fitness effects of adaptive amino acid substitutions could be substantially larger than substitutions in CNEs. We conclude that although there appear to be many more adaptive noncoding changes, substitutions in proteins may dominate phenotypic evolution