107 research outputs found
A two-sample Bayesian t-test for microarray data
BACKGROUND: Determining whether a gene is differentially expressed in two different samples remains an important statistical problem. Prior work in this area has featured the use of t-tests with pooled estimates of the sample variance based on similarly expressed genes. These methods do not display consistent behavior across the entire range of pooling and can be biased when the prior hyperparameters are specified heuristically. RESULTS: A two-sample Bayesian t-test is proposed for use in determining whether a gene is differentially expressed in two different samples. The test method is an extension of earlier work that made use of point estimates for the variance. The method proposed here explicitly calculates in analytic form the marginal distribution for the difference in the mean expression of two samples, obviating the need for point estimates of the variance without recourse to posterior simulation. The prior distribution involves a single hyperparameter that can be calculated in a statistically rigorous manner, making clear the connection between the prior degrees of freedom and prior variance. CONCLUSION: The test is easy to understand and implement and application to both real and simulated data shows that the method has equal or greater power compared to the previous method and demonstrates consistent Type I error rates. The test is generally applicable outside the microarray field to any situation where prior information about the variance is available and is not limited to cases where estimates of the variance are based on many similar observations
Telomere-associated endonuclease-deficient Penelope-like retroelements in diverse eukaryotes
Author Posting. © The Author(s), 2007. This is the author's version of the work. It is posted here by permission of National Academy of Sciences of the USA for personal use, not for redistribution. The definitive version was published in Proceedings of the National Academy of the United States of America 104 (2007): 9352-9357, doi:10.1073/pnas.0702741104.The evolutionary origin of telomerases, enzymes that maintain the ends of linear
chromosomes in most eukaryotes, is a subject of debate. Penelope-like elements
(PLEs) are a recently described class of eukaryotic retroelements characterized by
a GIY-YIG endonuclease domain and by a reverse transcriptase domain with
similarity to telomerases and group II introns. Here we report that a subset of
PLEs found in bdelloid rotifers, basidiomycete fungi, stramenopiles, and plants,
representing four different eukaryotic kingdoms, lack the endonuclease domain
and are located at telomeres. The 5' truncated ends of these elements are telomereoriented
and typically capped by species-specific telomeric repeats. Most of them
also carry several shorter stretches of telomeric repeats at or near their 3’ ends,
which could facilitate utilization of the telomeric G-rich 3’ overhangs to prime
reverse transcription. Many of these telomere-associated PLEs occupy a basal
phylogenetic position close to the point of divergence from the telomerase-PLE
common ancestor, and may descend from the missing link between early
eukaryotic retroelements and present-day telomerases.Financial support from NIH and the
U.S. National Science Foundation (MCB-0614142
Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes
Empirical substitution matrices represent the average tendencies of
substitutions over various protein families by sacrificing gene-level
resolution. We develop a codon-based model, in which mutational tendencies of
codon, a genetic code, and the strength of selective constraints against amino
acid replacements can be tailored to a given gene. First, selective constraints
averaged over proteins are estimated by maximizing the likelihood of each 1-PAM
matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution
matrices. Then, selective constraints specific to given proteins are
approximated as a linear function of those estimated from the empirical
substitution matrices.
Akaike information criterion (AIC) values indicate that a model allowing
multiple nucleotide changes fits the empirical substitution matrices
significantly better. Also, the ML estimates of transition-transversion bias
obtained from these empirical matrices are not so large as previously
estimated. The selective constraints are characteristic of proteins rather than
species. However, their relative strengths among amino acid pairs can be
approximated not to depend very much on protein families but amino acid pairs,
because the present model, in which selective constraints are approximated to
be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can
provide a good fit to other empirical substitution matrices including cpREV for
chloroplast proteins and mtREV for vertebrate mitochondrial proteins.
The present codon-based model with the ML estimates of selective constraints
and with adjustable mutation rates of nucleotide would be useful as a simple
substitution model in ML and Bayesian inferences of molecular phylogenetic
trees, and enables us to obtain biologically meaningful information at both
nucleotide and amino acid levels from codon and protein sequences.Comment: Table 9 in this article includes corrections for errata in the Table
9 published in 10.1371/journal.pone.0017244. Supporting information is
attached at the end of the article, and a computer-readable dataset of the ML
estimates of selective constraints is available from
10.1371/journal.pone.001724
Genetic diversity of simian lentivirus in wild De Brazza’s monkeys (Cercopithecus neglectus) in Equatorial Africa
De Brazza’s monkeys (Cercopithecus neglectus) are non-human primates (NHP) living in Equatorial Africa from South Cameroon through the Congo-Basin to Uganda. As most of the NHP living in sub-Saharan Africa, they are naturally infected with their own simian lentivirus, SIVdeb. Previous studies confirmed this infection for De Brazza’s from East Cameroon and Uganda. In this report, we studied the genetic diversity of SIVdeb in De Brazza’s monkeys from different geographical areas in South Cameroon and from the Democratic Republic of Congo (DRC). SIVdeb strains from east, central and western equatorial Africa form a species-specific monophyletic lineage. Phylogeographic clustering was observed among SIVdeb strains from Cameroon, the DRC and Uganda, but also among primates from distinct areas in Cameroon. These observations suggest a longstanding virus–host co-evolution. SIVdeb prevalence is high in wild De Brazza’s populations and thus represents a current risk for humans exposed to these primates in central Africa
The extraordinary evolutionary history of the reticuloendotheliosis viruses
The reticuloendotheliosis viruses (REVs) comprise several closely related amphotropic retroviruses isolated from birds. These viruses exhibit several highly unusual characteristics that have not so far been adequately explained, including their extremely close relationship to mammalian retroviruses, and their presence as endogenous sequences within the genomes of certain large DNA viruses. We present evidence for an iatrogenic origin of REVs that accounts for these phenomena. Firstly, we identify endogenous retroviral fossils in mammalian genomes that share a unique recombinant structure with REVs—unequivocally demonstrating that REVs derive directly from mammalian retroviruses. Secondly, through sequencing of archived REV isolates, we confirm that contaminated Plasmodium lophurae stocks have been the source of multiple REV outbreaks in experimentally infected birds. Finally, we show that both phylogenetic and historical evidence support a scenario wherein REVs originated as mammalian retroviruses that were accidentally introduced into avian hosts in the late 1930s, during experimental studies of P. lophurae, and subsequently integrated into the fowlpox virus (FWPV) and gallid herpesvirus type 2 (GHV-2) genomes, generating recombinant DNA viruses that now circulate in wild birds and poultry. Our findings provide a novel perspective on the origin and evolution of REV, and indicate that horizontal gene transfer between virus families can expand the impact of iatrogenic transmission events
Full-length genome sequence of a simian immunodeficiency virus (SIV) infecting a captive agile mangabey (Cercocebus agilis) is closely related to SIVrcm infecting wild red-capped mangabeys (Cercocebus torquatus) in Cameroon
Simian immunodeficiency viruses (SIVs) are lentiviruses that infect an extensive number of wild African primate species. Here we describe for the first time SIV infection in a captive agile mangabey (Cercocebus agilis) from Cameroon. Phylogenetic analysis of the full-length genome sequence of SIVagi-00CM312 showed that this novel virus fell into the SIVrcm lineage and was most closely related to a newly characterized SIVrcm strain (SIVrcm-02CM8081) from a wild-caught red-capped mangabey (Cercocebus torquatus) from Cameroon. In contrast to red-capped mangabeys, no 24 bp deletion in CCR5 has been observed in the agile mangabey. Further studies on wild agile mangabeys are needed to determine whether agile and red-capped mangabeys are naturally infected with the same SIV lineage, or whether this agile mangabey became infected with an SIVrcm strain in captivity. However, our study shows that agile mangabeys are susceptible to SIV infection
INDELible: A Flexible Simulator of Biological Sequence Evolution
Many methods exist for reconstructing phylogenies from molecular sequence data, but few phylogenies are known and can be used to check their efficacy. Simulation remains the most important approach to testing the accuracy and robustness of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions. We implement a portable and flexible application, named INDELible, for generating nucleotide, amino acid and codon sequence data by simulating insertions and deletions (indels) as well as substitutions. Indels are simulated under several models of indel-length distribution. The program implements a rich repertoire of substitution models, including the general unrestricted model and nonstationary nonhomogeneous models of nucleotide substitution, mixture, and partition models that account for heterogeneity among sites, and codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. With its many unique features, INDELible should be useful for evaluating the performance of many inference methods, including those for multiple sequence alignment, phylogenetic tree inference, and ancestral sequence, or genome reconstruction
A diversity of uncharacterized reverse transcriptases in bacteria
Retroelements are usually considered to be eukaryotic elements because of the large number and variety in eukaryotic genomes. By comparison, reverse transcriptases (RTs) are rare in bacteria, with only three characterized classes: retrons, group II introns and diversity-generating retroelements (DGRs). Here, we present the results of a bioinformatic survey that aims to define the landscape of RTs across eubacterial, archaeal and phage genomes. We identify and categorize 1021 RTs, of which the majority are group II introns (73%). Surprisingly, a plethora of novel RTs are found that do not belong to characterized classes. The RTs have 11 domain architectures and are classified into 20 groupings based on sequence similarity, phylogenetic analyses and open reading frame domain structures. Interestingly, group II introns are the only bacterial RTs to exhibit clear evidence for independent mobility, while five other groups have putative functions in defense against phage infection or promotion of phage infection. These examples suggest that additional beneficial functions will be discovered among uncharacterized RTs. The study lays the groundwork for experimental characterization of these highly diverse sequences and has implications for the evolution of retroelements
An Endogenous Foamy-like Viral Element in the Coelacanth Genome
Little is known about the origin and long-term evolutionary mode of retroviruses. Retroviruses can integrate into their hosts' genomes, providing a molecular fossil record for studying their deep history. Here we report the discovery of an endogenous foamy virus-like element, which we designate ‘coelacanth endogenous foamy-like virus’ (CoeEFV), within the genome of the coelacanth (Latimeria chalumnae). Phylogenetic analyses place CoeEFV basal to all known foamy viruses, strongly suggesting an ancient ocean origin of this major retroviral lineage, which had previously been known to infect only land mammals. The discovery of CoeEFV reveals the presence of foamy-like viruses in species outside the Mammalia. We show that foamy-like viruses have likely codiverged with their vertebrate hosts for more than 407 million years and underwent an evolutionary transition from water to land with their vertebrate hosts. These findings suggest an ancient marine origin of retroviruses and have important implications in understanding foamy virus biology
Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution
Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models
- …