Search CORE

FigShare

PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment

Author: A Löytynoja
A Löytynoja
A Löytynoja
A Löytynoja
A Pang
A Rambaut
A Varadarajan
Botond Sipos
D Tian
DT Gillespie
E Paradis
Gregory E Jordan
H Bengtson
H Philippe
JL Oliver
JL Thorne
JP Huelsenbeck
KP Schliep
LJ Harmon
M Blanchette
M Kimura
MS Rosenberg
N de la Chaux
N Goldman
N Goldman
Nick Goldman
RA Cartwright
S Whelan
TG Clark
Tim Massingham
W Fletcher
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The Monte Carlo simulation of sequence evolution is routinely used to assess the performance of phylogenetic inference methods and sequence alignment algorithms. Progress in the field of molecular evolution fuels the need for more realistic and hence more complex simulations, adapted to particular situations, yet current software makes unreasonable assumptions such as homogeneous substitution dynamics or a uniform distribution of indels across the simulated sequences. This calls for an extensible simulation framework written in a high-level functional language, offering new functionality and making it easy to incorporate further complexity. Results <monospace>PhyloSim</monospace> is an extensible framework for the Monte Carlo simulation of sequence evolution, written in R, using the Gillespie algorithm to integrate the actions of many concurrent processes such as substitutions, insertions and deletions. Uniquely among sequence simulation tools, <monospace>PhyloSim</monospace> can simulate arbitrarily complex patterns of rate variation and multiple indel processes, and allows for the incorporation of selective constraints on indel events. User-defined complex patterns of mutation and selection can be easily integrated into simulations, allowing <monospace>PhyloSim</monospace> to be adapted to specific needs. Conclusions Close integration with <monospace>R</monospace> and the wide range of features implemented offer unmatched flexibility, making it possible to simulate sequence evolution under a wide range of realistic settings. We believe that <monospace>PhyloSim</monospace> will be useful to future studies involving simulated alignments.</p

Springer - Publisher Connector

Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories

Author: Arenas Busto Miguel
Posada González David
Publication venue: Xenómica e Biomedicina
Publication date: 28/12/2023
Field of study

Genomic evolution can be highly heterogeneous. Here, we introduce a new framework to simulate genome-wide sequence evolution under a variety of substitution models that may change along the genome and the phylogeny, following complex multispecies coalescent histories that can include recombination, demographics, longitudinal sampling, population subdivision/species history, and migration. A key aspect of our simulation strategy is that the heterogeneity of the whole evolutionary process can be parameterized according to statistical prior distributions specified by the user. We used this framework to carry out a study of the impact of variable codon frequencies across genomic regions on the estimation of the genome-wide nonsynonymous/synonymous ratio. We found that both variable codon frequencies across genes and rate variation among sites and regions can lead to severe underestimation of the global dN/dS values. The program SGWE—Simulation of Genome-Wide Evolution—is freely available from http://code.google.com/p/sgwe-project/, including extensive documentation and detailed examples.Ministerio de Ciencia e Innovación | Ref. JCI-2011-1045

Investigo

Simulation of Molecular Data under Diverse Evolutionary Scenarios

Author: A Carvajal-Rodriguez
A Carvajal-Rodriguez
A Carvajal-Rodriguez
A Carvajal-Rodriguez
A Luo
A Pang
A Rambaut
A Varadarajan
B Padhukasahasram
B Peng
B Peng
B Sipos
BG Hall
BK Epperson
CC Spencer
CL Strope
CN Anderson
D Posada
D Posada
DA Dalquen
DJ Wilson
DM Raup
EG DeChaine
F Calafell
Fran Lewitter
G Ewing
G Laval
GA McVean
J Novembre
J Novembre
J Stoye
J Sullivan
J Wakeley
JA Coombs
JL Kelley
K Bozek
L Arbiza
L Excoffier
L Excoffier
L Excoffier
L Excoffier
LL Cavalli-Sforza
M Anisimova
M Arenas
M Arenas
M Arenas
M Arenas
M Arenas
M Arenas
M Arenas
M Arenas
M Arenas
M Navascues
M Nordborg
M Slatkin
M Wang
MA Beaumont
MA Beaumont
MH Schierup
Miguel Arenas
MK Kuhner
MS Rosenberg
N Ray
N Ray
NC Grassly
O François
O Westesson
P Lemey
P Marjoram
R Ihaka
RA Cartwright
RD Hernandez
RG Beiko
RM Durbin
RR Hudson
RR Hudson
RR Hudson
S Biswas
S Guindon
S Hoban
S Kryazhimskiy
S Neuenschwander
SE Ramos-Onsins
SL Peck
T Gesell
TC Jones
U Bastolla
W Fletcher
WG Hill
Y Liu
Z Yang
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/05/2012
Field of study

St Andrews Research Repository

Estimating empirical codon hidden Markov models

Author: De Maio Nicola
Holmes Ian
Kosiol Carolin
Schlötterer Christian
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/02/2017
Field of study

Empirical codon models (ECMs) estimated from a large number of globular protein families outperformed mechanistic codon models in their description of the general process of protein evolution. Among other factors, ECMs implicitly model the influence of amino acid properties and multiple nucleotide substitutions (MNS). However, the estimation of ECMs requires large quantities of data, and until recently, only few suitable data sets were available. Here, we take advantage of several new Drosophila species genomes to estimate codon models from genome-wide data. The availability of large numbers of genomes over varying phylogenetic depths in the Drosophila genus allows us to explore various divergence levels. In consequence, we can use these data to determine the appropriate level of divergence for the estimation of ECMs, avoiding overestimation of MNS rates caused by saturation. To account for variation in evolutionary rates along the genome, we develop new empirical codon hidden Markov models (ecHMMs). These models significantly outperform previous ones with respect to maximum likelihood values, suggesting that they provide a better fit to the evolutionary process. Using ECMs and ecHMMs derived from genome-wide data sets, we devise new likelihood ratio tests (LRTs) of positive selection. We found classical LRTs very sensitive to the presence of MNSs, showing high false-positive rates, especially with small phylogenies. The new LRTs are more conservative than the classical ones, having acceptable false-positive rates and reduced power.Publisher PDFPeer reviewe

Fast Statistical Alignment

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

Public Library of Science (PLOS)

Public Library of Science (PLOS)

Caltech Authors

Alignment and Prediction of cis-Regulatory Modules Based on a Probabilistic Model of Evolution

Author: A Bais
A Halpern
A Lifanov
A Moses
A Moses
A Moses
A Siepel
B Berman
B Knudsen
C Bergman
C Bergman
C Dewey
D Halligan
D Karolchik
D Pollard
D Pollard
D Raijman
E Berezikov
E Birney
E Blackwood
E Davidson
E Dermitzakis
F Gao
G Lunter
G Lunter
G Lunter
G Stormo
G Wray
G Wray
I Holmes
I Holmes
I Holmes
I Miklos
J Berg
J Stone
J Thorne
J Thorne
J Warner
K Wong
M Brudno
M Frith
M Frith
M Hasegawa
M Ludwig
M Ludwig
M Noyes
O Hallikas
P Andolfatto
P Keightley
P Kheradpour
P Ray
P Tomancak
R Cartwright
R Durrett
R Satija
R Siddharthan
R Waterston
S Aerts
S Doniger
S Gallo
S MacArthur
S Sinha
S Sinha
Saurabh Sinha
V Mustonen
W Huang
W Wasserman
W Wong
Wyeth W. Wasserman
X Li
X Li
Xin He
Xu Ling
Z Hu
Publication venue: Public Library of Science
Publication date: 01/03/2009
Field of study

Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context

Public Library of Science (PLOS)

Evolutionary Modeling and Prediction of Non-Coding RNAs in Drosophila

Author: A Siepel
A Siepel
A Stark
A Varadarajan
AG Clark
Andrew V. Uzilov
B Knudsen
B Paten
CN Dewey
D Rose
D St Johnston
DP Bartel
DS Parker
E Boyle
E Lcuyer
E Nawrocki
E Rivas
E Rivas
E Torarinsson
G McGuire
Ian Holmes
IL Hofacker
J Brennecke
J Pedersen
J Ruby
JL Thorne
JP Bachellerie
JR Manak
JS Pedersen
JS Pedersen
JS Pedersen
KS Pollard
Lars Barquist
M Crosby
M Mandal
M Pheasant
M Sprinzl
Mitchell E. Skinner
N Bray
N Goldman
PD Rijk
PS Klosterman
RD Dowell
RD Dowell
RK Bradley
Robert Belshaw
Robert K. Bradley
S Griffiths-Jones
S Washietl
T Babak
T Elgavish
T Gesell
TM Lowe
V Ambros
WJ Bruno
YR Bendana
Yuri R. Bendaña
Z Wang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3′ end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA

CiteSeerX