Search CORE

317 research outputs found

In silico segmentations of lentivirus envelope sequences

Author: Aurélia Boissin-quillon
Bmc Bioinformatics
Caroline Leroux
Caroline Leroux
Didier Piau
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: The gene encoding the envelope of lentiviruses exhibits a considerable plasticity, particularly the region which encodes the surface (SU) glycoprotein. Interestingly, mutations do not appear uniformly along the sequence of SU, but they are clustered in restricted areas, called variable (V) regions, which are interspersed with relatively more stable regions, called constant (C) regions. We look for specific signatures of C/V regions, using hidden Markov models constructed with SU sequences of the equine, human, small ruminant and simian lentiviruses. RESULTS: Our models yield clear and accurate delimitations of the C/V regions, when the test set and the training set were made up of sequences of the same lentivirus, but also when they were made up of sequences of different lentiviruses. Interestingly, the models predicted the different regions of lentiviruses such as the bovine and feline lentiviruses, not used in the training set. Models based on composite training sets produce accurate segmentations of sequences of all these lentiviruses. CONCLUSION: Our results suggest that each C/V region has a specific statistical oligonucleotide composition, and that the C (respectively V) regions of one of these lentiviruses are statistically more similar to the C (respectively V) regions of the other lentiviruses, than to the V (respectively C) regions of the same lentivirus

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

ProdInra

Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences

Author: A Wagner
Adrian Schneider
B Chor
C Dessimoz
C Dessimoz
C Seoighe
Christophe Dessimoz
DL Swofford
DT Jones
ET Dermitzakis
G Blanc
Gaston H Gonnet
GC Conant
GH Gonnet
GH Gonnet
GH Gonnet
GM Cannarozzi
J Felsenstein
J Felsenstein
LB Koski
M Bulmer
M Hasegawa
M Kellis
Manuel Gil
MO Dayhoff
N Goldman
S Ohno
T Jukes
T Muller
TF DeLuca
Y Van de Peer
YJ Li
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The estimation of the difference between two evolutionary distances within a triplet of homologs is a common operation that is used for example to determine which of two sequences is closer to a third one. The most accurate method is currently maximum likelihood over the entire triplet. However, this approach is relatively time consuming. RESULTS: We show that an alternative estimator, based on pairwise estimates and therefore much faster to compute, has almost the same statistical power as the maximum likelihood estimator. We also provide a numerical approximation for its variance, which could otherwise only be estimated through an expensive re-sampling approach such as bootstrapping. An extensive simulation demonstrates that the approximation delivers precise confidence intervals. To illustrate the possible applications of these results, we show how they improve the detection of asymmetric evolution, and the identification of the closest relative to a given sequence in a group of homologs. CONCLUSION: The results presented in this paper constitute a basis for large-scale protein cross-comparisons of pairwise evolutionary distances

Repository for Publications and Research Data

Crossref

Springer - Publisher Connector

PubMed Central

UCL Discovery

On the entropy of protein families

Author: Barton John
Chakraborty Arup
Cocco Simona
Jacquin Hugo
Monasson Rémi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/12/2015
Field of study

Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.Comment: to appear in Journal of Statistical Physic

arXiv.org e-Print Archive

DSpace@MIT

Hal-Diderot

In search of lost introns

Author: Adachi
Aldous
Altschul
Bieri
Blum
Carmel
Collins
Coulombe-Huntington
Csűrös
Csűrös
Devroye
Durbin
Edgar
Felsenstein
Felsenstein
Felsenstein
Friedman
Guindon
Harding
Heard
Hubbard
Igor B. Rogozin
IHBSC
J. Andrew Holey
Jeffares
Kececioglu
Kosakovsky Pond
Larget
Ma
Marchler-Bauer
McDiarmid
McKenzie
Miklós Csűrös
Müller
Nguyen
Nielsen
Nixon
Press
Pruitt
Raible
Rogozin
Rogozin
Rosenberg
Roy
Roy
Roy
Roy
Stamatakis
Steel
Sverdlov
Sverdlov
Tatusov
Vaňácová
Zhang
Publication venue
Publication date: 03/02/2007
Field of study

Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon-intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools. We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after

O(nL)

preprocessing time, subsequent evaluations take

O(nL/\log L)

time almost surely in the Yule-Harding random model of

n

-taxon phylogenies, where

L

is the input sequence length. We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now

arXiv.org e-Print Archive

Crossref

College of Saint Benedict and Saint John’s University: DigitalCommons@CSB/SJU

Topology identifies emerging adaptive mutations in SARS-CoV-2

Author: Bauer Ulrich
Bleher Michael
Carriere Mathieu
Hahn Lukas
Ott Andreas
Patino-Galindo Juan Angel
Rabadan Raul
Publication venue
Publication date: 14/06/2021
Field of study

The COVID-19 pandemic has lead to a worldwide effort to characterize its evolution through the mapping of mutations in the genome of the coronavirus SARS-CoV-2. Ideally, one would like to quickly identify new mutations that could confer adaptive advantages (e.g. higher infectivity or immune evasion) by leveraging the large number of genomes. One way of identifying adaptive mutations is by looking at convergent mutations, mutations in the same genomic position that occur independently. However, the large number of currently available genomes precludes the efficient use of phylogeny-based techniques. Here, we establish a fast and scalable Topological Data Analysis approach for the early warning and surveillance of emerging adaptive mutations based on persistent homology. It identifies convergent events merely by their topological footprint and thus overcomes limitations of current phylogenetic inference techniques. This allows for an unbiased and rapid analysis of large viral datasets. We introduce a new topological measure for convergent evolution and apply it to the GISAID dataset as of February 2021, comprising 303,651 high-quality SARS-CoV-2 isolates collected since the beginning of the pandemic. We find that topologically salient mutations on the receptor-binding domain appear in several variants of concern and are linked with an increase in infectivity and immune escape, and for many adaptive mutations the topological signal precedes an increase in prevalence. We show that our method effectively identifies emerging adaptive mutations at an early stage. By localizing topological signals in the dataset, we extract geo-temporal information about the early occurrence of emerging adaptive mutations. The identification of these mutations can help to develop an alert system to monitor mutations of concern and guide experimentalists to focus the study of specific circulating variants

arXiv.org e-Print Archive

Statistical methods for DNA sequences detection of recombination and distance estimation

Author: McGuire Grainne
Publication venue: The University of Edinburgh
Publication date: 01/01/1998
Field of study

Edinburgh Research Archive

Incorporating regional context into pairwise alignments of biological sequences

Author: Sammut Raymond
Publication venue
Publication date: 13/09/2018
Field of study

The Australian National University

Statistical Methods for Conservation and Alignment Quality in Proteins

Author: Ahola Virpi
Publication venue: Annales Universitatis Turkuensis AII 228
Publication date: 07/11/2008
Field of study

Construction of multiple sequence alignments is a fundamental task in Bioinformatics. Multiple sequence alignments are used as a prerequisite in many Bioinformatics methods, and subsequently the quality of such methods can be critically dependent on the quality of the alignment. However, automatic construction of a multiple sequence alignment for a set of remotely related sequences does not always provide biologically relevant alignments.Therefore, there is a need for an objective approach for evaluating the quality of automatically aligned sequences. The profile hidden Markov model is a powerful approach in comparative genomics. In the profile hidden Markov model, the symbol probabilities are estimated at each conserved alignment position. This can increase the dimension of parameter space and cause an overfitting problem. These two research problems are both related to conservation. We have developed statistical measures for quantifying the conservation of multiple sequence alignments. Two types of methods are considered, those identifying conserved residues in an alignment position, and those calculating positional conservation scores. The positional conservation score was exploited in a statistical prediction model for assessing the quality of multiple sequence alignments. The residue conservation score was used as part of the emission probability estimation method proposed for profile hidden Markov models. The results of the predicted alignment quality score highly correlated with the correct alignment quality scores, indicating that our method is reliable for assessing the quality of any multiple sequence alignment. The comparison of the emission probability estimation method with the maximum likelihood method showed that the number of estimated parameters in the model was dramatically decreased, while the same level of accuracy was maintained. To conclude, we have shown that conservation can be successfully used in the statistical model for alignment quality assessment and in the estimation of emission probabilities in the profile hidden Markov models.Siirretty Doriast

UTUPub

Analysis of among-site variation in substitution patterns

Author: A Reyes
AM Pedersen
C Lanave
DD Pollock
DL Swofford
DM Robinson
H Akaike
J Sullivan
JA Rice
JD Thompson
JJ Faith
JP Bielawski
JP Huelsenbeck
KP Burnham
LA Frederico
LA Frederico
M Hasegawa
M Tanaka
MP Francino
R Nielsen
Z Yang
Z Yang
Z Yang
Z Yang
Publication venue: Biological Procedures Online
Publication date: 01/01/2004
Field of study

Substitution patterns among nucleotides are often assumed to be constant in phylogenetic analyses. Although variation in the average rate of substitution among sites is commonly accounted for, variation in the relative rates of specific types of substitution is not. Here, we review details of methodologies used for detecting and analyzing differences in substitution processes among predefined groups of sites. We describe how such analyses can be performed using existing phylogenetic tools, and discuss how new phylogenetic analysis tools we have recently developed can be used to provide more detailed and sensitive analyses, including study of the evolution of mutation and substitution processes. As an example we consider the mitochondrial genome, for which two types of transition deaminations (C⇒T and A⇒G) are strongly affected by single-strandedness during replication, resulting in a strand asymmetric mutation process. Since time spent single-stranded varies along the mitochondrial genome, their differential mutational response results in very different substitution patterns in different regions of the genome

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Causes and consequences of purifying selection on SARS-CoV-2

Author: Cano Laura
Castillo Morales Atahualpa
Ho Alex
Hurst Laurence
Kudla Grzegorz
Mordstein Christine
Mühlhausen Stefanie
Rice Alan M.
Watson Samir
Young Bethan
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/08/2021
Field of study

Owing to a lag between a deleterious mutation’s appearance and its selective removal, gold-standard methods for mutation rate estimation assume no meaningful loss of mutations between parents and offspring. Indeed, from analysis of closely related lineages, in SARS-CoV-2, the Ka/Ks ratio was previously estimated as 1.008, suggesting no within-host selection. By contrast, we find a higher number of observed SNPs at 4-fold degenerate sites than elsewhere and, allowing for the virus’s complex mutational and compositional biases, estimate that the mutation rate is at least 49–67% higher than would be estimated based on the rate of appearance of variants in sampled genomes. Given the high Ka/Ks one might assume that the majority of such intrahost selection is the purging of nonsense mutations. However, we estimate that selection against nonsense mutations accounts for only ∼10% of all the “missing” mutations. Instead, classical protein-level selective filters (against chemically disparate amino acids and those predicted to disrupt protein functionality) account for many missing mutations. It is less obvious why for an intracellular parasite, amino acid cost parameters, notably amino acid decay rate, is also significant. Perhaps most surprisingly, we also find evidence for real-time selection against synonymous mutations that move codon usage away from that of humans. We conclude that there is common intrahost selection on SARS-CoV-2 that acts on nonsense, missense, and possibly synonymous mutations. This has implications for methods of mutation rate estimation, for determining times to common ancestry and the potential for intrahost evolution including vaccine escape

OPUS

PubMed Central

Edinburgh Research Explorer