241 research outputs found

    Exploring the relationship between sequence similarity and accurate phylogenetic trees

    Get PDF
    © 2006 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. The definitive version was published in Molecular Biology and Evolution 23(2006): 2090-2100, doi:10.1093/molbev/msl080.We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100–400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10–60, produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag–like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.This work was supported by National Institutes of Health grant AI1058054 to M. Sogin

    Myelin tetraspan family proteins but no non-tetraspan family proteins are present in the ascidian (Ciona intestinalis) genome

    Get PDF
    Author Posting. © Marine Biological Laboratory, 2005. This article is posted here by permission of Marine Biological Laboratory for personal use, not for redistribution. The definitive version was published in Biological Bulletin 209 (2005): 49-66.Several of the proteins used to form and maintain myelin sheaths in the central nervous system (CNS) and the peripheral nervous system (PNS) are shared among different vertebrate classes. These proteins include one-to-several alternatively spliced myelin basic protein (MBP) isoforms in all sheaths, proteolipid protein (PLP) and DM20 (except in amphibians) in tetrapod CNS sheaths, and one or two protein zero (P0) isoforms in fish CNS and in all vertebrate PNS sheaths. Several other proteins, including 2', 3'-cyclic nucleotide 3'-phosphodiesterase (CNP), myelin and lymphocyte protein (MAL), plasmolipin, and peripheral myelin protein 22 (PMP22; prominent in PNS myelin), are localized to myelin and myelin-associated membranes, though class distributions are less well studied. Databases with known and identified sequences of these proteins from cartilaginous and teleost fishes, amphibians, reptiles, birds, and mammals were prepared and used to search for potential homologs in the basal vertebrate, Ciona intestinalis. Homologs of lipophilin proteins, MAL/plasmolipin, and PMP22 were identified in the Ciona genome. In contrast, no MBP, P0, or CNP homologs were found. These studies provide a framework for understanding how myelin proteins were recruited during evolution and how structural adaptations enabled them to play key roles in myelination.This work was supported by grant IBN-0402188 from the National Science Foundation (RMG)

    Characterisation of the subtelomeric regions of Giardia lamblia genome isolate WBC6

    Get PDF
    Author Posting. © The Author(s), 2007. This is the author's version of the work. It is posted here by permission of Elsevier B.V. for personal use, not for redistribution. The definitive version was published in International Journal for Parasitology 37 (2007): 503-513, doi:10.1016/j.ijpara.2006.12.011.Giardia trophozoites are polyploid and have five chromosomes. The chromosome homologues demonstrate considerable size heterogeneity due to variation in the subtelomeric regions. We used clones from the genome project with telomeric sequence at one end to identify six subtelomeric regions in addition to previously identified subtelomeric regions, to study the telomeric arrangement of the chromosomes. The subtelomeric regions included two retroposons, one retroposon pseudogene, and two vsp genes, in addition to the previously identified subtelomeric regions that include ribosomal DNA repeats. The presence of vsp genes in a subtelomeric region suggests that telomeric rearrangements may contribute to the generation of vsp diversity. These studies of the subtelomeric regions of Giardia may contribute to our understanding of the factors that maintain stability, while allowing diversity in chromosome structure.This work was supported in part by NIH grant AI43273 to Mitchell L. Sogin. Additional support was provided by the G. Unger Vetlesen Foundation and LI-COR Biotechnology

    DRISEE overestimates errors in metagenomic sequencing data

    Get PDF
    © The Author(s), 2013. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Briefings in Bioinformatics 15 (2014): 783-787, doi:10.1093/bib/bbt010.The extremely high error rates reported by Keegan et al. in ‘A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE’ (PLoS Comput Biol 2012;8:e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.National Institutes of Health [1UH2DK083993 to M.L.S.]; National Science Foundation [BDI- 096026 to S.M.H.]

    Evolution of eukaryotic transcription : insights from the genome of Giardia lamblia

    Get PDF
    Author Posting. © Cold Spring Harbor Laboratory Press, 2004. This article is posted here by permission of Cold Spring Harbor Laboratory Press for personal use, not for redistribution. The definitive version was published in Genome Research 14 (2004): 1537-1547, doi:10.1101/gr.2256604.The Giardia lamblia genome sequencing project affords us a unique opportunity to conduct comparative analyses of core cellular systems between early and late-diverging eukaryotes on a genome-wide scale. We report a survey to identify canonical transcription components in Giardia, focusing on RNA polymerase (RNAP) subunits and transcription-initiation factors. Our survey revealed that Giardia contains homologs to 21 of the 28 polypeptides comprising eukaryal RNAPI, RNAPII, and RNAPIII; six of the seven RNAP subunits without giardial homologs are polymerase specific. Components of only four of the 12 general transcription initiation factors have giardial homologs. Surprisingly, giardial TATA-binding protein (TBP) is highly divergent with respect to archaeal and higher eukaryotic TBPs, and a giardial homolog of transcription factor IIB was not identified. We conclude that Giardia represents a transition during the evolution of eukaryal transcription systems, exhibiting a relatively complete set of RNAP subunits and a rudimentary basal initiation apparatus for each transcription system. Most class-specific RNAP subunits and basal initiation factors appear to have evolved after the divergence of Giardia from the main eukaryotic line of descent. Consequently, Giardia is predicted to be unique in many aspects of transcription initiation with respect to paradigms derived from studies in crown eukaryotes.This work was supported in part by NIH grant AI43273 to M.L.S., by NIH grant AI51089 to A.G.M, and DOE grant DE-FG02-01ER63201 to G.J.O. Additional support was provided by the G. Unger Vetlesen Foundation and LI-COR Biotechnology

    Optical map of the genotype A1 WB C6 Giardia lamblia genome isolate

    Get PDF
    Author Posting. © The Author(s), 2011. This is the author's version of the work. It is posted here by permission of Elsevier B.V. for personal use, not for redistribution. The definitive version was published in Molecular and Biochemical Parasitology 180 (2011): 112-114, doi:10.1016/j.molbiopara.2011.07.008.The Giardia lamblia genome consists of 12 Mb divided among 5 chromosomes ranging in size from approximately 1 to 4 Mb. The assembled contigs of the genotype A1 isolate, WB, were previously mapped along the 5 chromosomes on the basis of hybridization of plasmid clones representing the contigs to chromosomes separated by PFGE. In the current report, we have generated an MluI optical map of the WB genome to improve the accuracy of the physical map. This has allowed us to correct several assembly errors and to better define the extent of the subtelomeric regions that are not included in the genome assembly.This work was funded in part by the Woods Hole Center for Oceans and Human Health, jointly funded by the National Science Foundation (OCE-0430724) and the National Institute for Environmental Health Sciences (P50 ES012742)

    The transcriptional response to encystation stimuli in Giardia lamblia is restricted to a small set of genes

    Get PDF
    Author Posting. © The Author(s), 2010. This is the author's version of the work. It is posted here by permission of American Society for Microbiology for personal use, not for redistribution. The definitive version was published in Eukaryotic Cell 9 (2010): 1566-1576, doi:10.1128/EC.00100-10.The protozoan parasite Giardia lamblia undergoes stage-differentiation in the small intestine of the host to an environmentally resistant and infectious cyst. Encystation involves secretion of an extracellular matrix comprised of cyst wall proteins (CWPs) and a ÎČ(1-3)-GalNAc homopolymer. Upon induction of encystation, genes coding for CWPs are switched on, and mRNAs coding for a transcription factor Myb and enzymes involved in cyst wall glycan synthesis are upregulated. Encystation in vitro is triggered by several protocols, which call for changes in bile concentrations or availability of lipids, and elevated pH. However, the conditions for induction are not standardized and we predicted significant protocol-specific side effects. This makes reliable identification of encystation factors difficult. Here, we exploited the possibility to induce encystation with two different protocols, which we show to be equally effective, for a comparative mRNA profile analysis. The standard encystation protocol induced a bipartite transcriptional response with surprisingly minor involvement of stress genes. A comparative analysis revealed a core set of only 18 encystation genes and showed that a majority of genes was indeed upregulated as a side effect of inducing conditions. We also established a Myb binding sequence as a signature motif in encystation promoters, suggesting coordinated regulation of these factors.We acknowledge in particular the “Stiftung zur Förderung der Wissenschaftlichen Forschung an der UniversitĂ€t ZĂŒrich” for financial support for this project. C.S. was supported by the Roche and Novartis Foundation, and “Stiftung fĂŒr Forschungsförderung” University of Zurich. Research in the Hehl laboratory is supported by the Swiss National Science Foundation (grant #31003A-125389)

    Accuracy and quality of massively parallel DNA pyrosequencing

    Get PDF
    © 2007 Huse et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The definitive version was published in Genome Biology 8 (2007): R143, doi:10.1186/gb-2007-8-7-r143.Additional data file 1 is a fasta file of the 43 known sequences used. Additional data file 2 is a gzip-compressed fasta file of the sequences output by the GS20. These sequences correspond to those included in Additional data files 3, 4, 5 but include only the final sequence information. Additional data files 3, 4, 5 are three compressed text files representing the text translations of the original GS20 binary output (sff) files for all of the sequencing used in the analysis, including sequence, flowgram and other run information. GS20 data are reported by region of the PicoTiterPlateℱ; we sequenced three plate regions.Massively parallel pyrosequencing systems have increased the efficiency of DNA sequencing, although the published per-base accuracy of a Roche GS20 is only 96%. In genome projects, highly redundant consensus assemblies can compensate for sequencing errors. In contrast, studies of microbial diversity that catalogue differences between PCR amplicons of ribosomal RNA genes (rDNA) or other conserved gene families cannot take advantage of consensus assemblies to detect and minimize incorrect base calls. We performed an empirical study of the per-base error rate for the Roche GS20 system using sequences of the V6 hypervariable region from cloned microbial ribosomal DNA (tag sequencing). We calculated a 99.5% accuracy rate in unassembled sequences, and identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better. By using objective criteria to eliminate low quality data, the quality of individual GS20 sequence reads in molecular ecological applications can surpass the accuracy of traditional capillary methods.This work was supported by National Aeronautics and Space Administration Astrobiology Institute Cooperative Agreement NNA04CC04A (to MLS), subcontracts from the Woods Hole Center for Oceans and Human Health from the National Institutes of Health and National Science Foundation (NIH/NIEHS 1 P50 ES012742-01 and NSF/OCE 0430724-J Stegeman PI to HGM and MLS), grants from the WM Keck Foundation and the G Unger Vetlesen Foundation (to MLS), and a National Research Council Research Associateship Award (to JAH)

    Minimum entropy decomposition : unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences

    Get PDF
    © The Author(s), 2014. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in ISME Journal 9 (2015): 968–979, doi:10.1038/ismej.2014.195.Molecular microbial ecology investigations often employ large marker gene datasets, for example, ribosomal RNAs, to represent the occurrence of single-cell genomes in microbial communities. Massively parallel DNA sequencing technologies enable extensive surveys of marker gene libraries that sometimes include nearly identical sequences. Computational approaches that rely on pairwise sequence alignments for similarity assessment and de novo clustering with de facto similarity thresholds to partition high-throughput sequencing datasets constrain fine-scale resolution descriptions of microbial communities. Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes’, which represent homogeneous operational taxonomic units. By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation. When applied to analyses of microbiomes from two deep-sea cryptic sponges Hexadella dedritifera and Hexadella cf. dedritifera, MED resolved a key Gammaproteobacteria cluster into multiple MED nodes that are specific to different sponges, and revealed that these closely related sympatric sponge species maintain distinct microbial communities. MED analysis of a previously published human oral microbiome dataset also revealed that taxa separated by less than 1% sequence variation distributed to distinct niches in the oral cavity. The information theory-guided decomposition process behind the MED algorithm enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.AME was supported by a G. Unger Vetlesen Foundation grant to the Marine Biological Laboratory and the Alfred P Sloan Foundation

    Brief report: Using global positioning system (GPS) enabled cell phones to examine adolescent travel patterns and time in proximity to alcohol outlets

    Get PDF
    As adolescents gain freedom to explore new environments unsupervised, more time in proximity to alcohol outlets may increase risks for alcohol and marijuana use. This pilot study: 1) Describes variations in adolescents' proximity to outlets by time of day and day of the week, 2) Examines variations in outlet proximity by drinking and marijuana use status, and 3) Tests feasibility of obtaining real-time data to study adolescent proximity to outlets. U.S. adolescents (N = 18) aged 16–17 (50% female) carried GPS-enabled smartphones for one week with their locations tracked. The geographic areas where adolescents spend time, activity spaces, were created by connecting GPS points sequentially and adding spatial buffers around routes. Proximity to outlets was greater during after school and evening hours. Drinkers and marijuana users were in proximity to outlets 1Âœ to 2 times more than non-users. Findings provide information about where adolescents spend time and times of greatest risk, informing prevention efforts
    • 

    corecore