7,112 research outputs found

    Bioinformatics tools for analysing viral genomic data

    Get PDF
    The field of viral genomics and bioinformatics is experiencing a strong resurgence due to high-throughput sequencing (HTS) technology, which enables the rapid and cost-effective sequencing and subsequent assembly of large numbers of viral genomes. In addition, the unprecedented power of HTS technologies has enabled the analysis of intra-host viral diversity and quasispecies dynamics in relation to important biological questions on viral transmission, vaccine resistance and host jumping. HTS also enables the rapid identification of both known and potentially new viruses from field and clinical samples, thus adding new tools to the fields of viral discovery and metagenomics. Bioinformatics has been central to the rise of HTS applications because new algorithms and software tools are continually needed to process and analyse the large, complex datasets generated in this rapidly evolving area. In this paper, the authors give a brief overview of the main bioinformatics tools available for viral genomic research, with a particular emphasis on HTS technologies and their main applications. They summarise the major steps in various HTS analyses, starting with quality control of raw reads and encompassing activities ranging from consensus and de novo genome assembly to variant calling and metagenomics, as well as RNA sequencing

    Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi

    Get PDF
    DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi

    ViCTree: an automated framework for taxonomic classification from protein sequences

    Get PDF
    Motivation: The increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualisation tool that enables the tree to be explored interactively in the context of pairwise distance data. Results: To demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus. Availability: ViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at http://bioinformatics.cvr.ac.uk/victree_web/

    Metagenomics for Bacteriology

    Get PDF
    The study of bacteria, or bacteriology, has gone through transformative waves since its inception in the 1600s. It all started by the visualization of bacteria using light microscopy by Antonie van Leeuwenhoek, when he first described “animalcules.” Direct cellular observation then evolved into utilizing different wavelengths on novel platforms such as electron, fluorescence, and even near-infrared microscopy. Understanding the link between microbes and disease (pathogenicity) began with the ability to isolate and cultivate organisms through aseptic methodologies starting in the 1700s. These techniques became more prevalent in the following centuries with the work of famous scientists such as Louis Pasteur and Robert Koch, and many others since then. The relationship between bacteria and the host’s immune system was first inferred in the 1800s, and to date is continuing to unveil its mysteries. During the last century, researchers initiated the era of molecular genetics. The discovery of the first-generation sequencing technology, the Sanger method, and, later, the polymerase chain reaction technology propelled the molecular genetics field by exponentially expanding the knowledge of relationship between gene structure and function. The rise of commercially available next-generation sequencing methodologies, in the beginning of this century, is drastically allowing larger amount of information to be acquired, in a manner open to the democratization of the approach

    Bayesian Model-building in Phylogenetics

    Get PDF
    DNA sequencing costs have decreased dramatically over recent decades, resulting in a flood of phylogenetic information available to researchers. While it is often assumed that additional data will lead to more accurate conclusions, it also raises a number of problems for phylogeneticists, including mundane computational issues such as data management and complex statistical problems such as obtaining a single species tree from multiple conflicting gene trees. Developing new methods to make better use of existing data and probe the causes of conflicting signal will be necessary to confidently resolve phylogenies in the genomic era. Here, we examine two current problems in statistical phylogenetics and attempt to address them in a Bayesian framework. The first problem involves inflated tree lengths in Bayesian phylogenies, which can be an order of magnitude longer than maximum likelihood estimates. We developed EmpPrior, a program which queries TreeBASE for datasets similar to the focal data, then estimates parameters from each dataset to inform priors on the focal data. This approach greatly improves the tree length credible intervals in four exemplar datasets and, when combined with other approaches such as the use of a compound Dirichlet prior on tree length, can nearly eliminate the problem of inflated trees. The second problem involves incongruence between morphological and molecular phylogenies in squamates. Here, we use posterior prediction with inferential test statistics to investigate whether systematic error may be biasing inference in the molecular data. While we detected some model violation in most of the 44 genes, the genes with the most model violation were more distant from the molecular phylogeny. This suggests that model violation is not a major source of error in the molecular data. Hence, the source of incongruence between the molecular and morphological squamate topologies remains unknown. In both problems, we found that incorporating tools such as informed priors and posterior prediction from Bayesian statistical literature into phylogenetic analyses can improve results and help uncover why different datasets lead to conflicting topologies. As phylogenetic datasets continue to grow, using methodological best practices will only become more important if we want to have confidence in our conclusions

    iDNA from terrestrial haematophagous leeches as a wildlife surveying and monitoring tool - prospects, pitfalls and avenues to be developed

    Get PDF
    Invertebrate-derived DNA (iDNA) from terrestrial haematophagous leeches has recently been proposed as a powerful non-invasive tool with which to detect vertebrate species and thus to survey their populations. However, to date little attention has been given to whether and how this, or indeed any other iDNA-derived data, can be combined with state-of-the-art analytical tools to estimate wildlife abundances, population dynamics and distributions. In this review, we discuss the challenges that face the application of existing analytical methods such as site-occupancy and spatial capture-recapture (SCR) models to terrestrial leech iDNA, in particular, possible violations of key assumptions arising from factors intrinsic to invertebrate parasite biology. Specifically, we review the advantages and disadvantages of terrestrial leeches as a source of iDNA and summarize the utility of leeches for presence, occupancy, and spatial capture-recapture models. The main source of uncertainty that attends species detections derived from leech gut contents is attributable to uncertainty about the spatio-temporal sampling frame, since leeches retain host-blood for months and can move after feeding. Subsequently, we briefly address how the analytical challenges associated with leeches may apply to other sources of iDNA. Our review highlights that despite the considerable potential of leech (and indeed any) iDNA as a new survey tool, further pilot studies are needed to assess how analytical methods can overcome or not the potential biases and assumption violations of the new field of iDNA. Specifically we argue that studies to compare iDNA sampling with standard survey methods such as camera trapping, and those to improve our knowledge on leech (and other invertebrate parasite) physiology, taxonomy, and ecology will be of immense future value

    Incorporating molecular data in fungal systematics: a guide for aspiring researchers

    Full text link
    The last twenty years have witnessed molecular data emerge as a primary research instrument in most branches of mycology. Fungal systematics, taxonomy, and ecology have all seen tremendous progress and have undergone rapid, far-reaching changes as disciplines in the wake of continual improvement in DNA sequencing technology. A taxonomic study that draws from molecular data involves a long series of steps, ranging from taxon sampling through the various laboratory procedures and data analysis to the publication process. All steps are important and influence the results and the way they are perceived by the scientific community. The present paper provides a reflective overview of all major steps in such a project with the purpose to assist research students about to begin their first study using DNA-based methods. We also take the opportunity to discuss the role of taxonomy in biology and the life sciences in general in the light of molecular data. While the best way to learn molecular methods is to work side by side with someone experienced, we hope that the present paper will serve to lower the learning threshold for the reader.Comment: Submitted to Current Research in Environmental and Applied Mycology - comments most welcom

    Deflating trees: Improving bayesian branch-length estimates using informed priors

    Get PDF
    © 2015 © The Author(s) 2015. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: [email protected]. Prior distributions can have a strong effect on the results of Bayesian analyses. However, no general consensus exists for how priors should be set in all circumstances. Branch-length priors are of particular interest for phylogenetics, because they affect many parameters and biologically relevant inferences have been shown to be sensitive to the chosen prior distribution. Here, we explore the use of outside information to set informed branch-length priors and compare inferences from these informed analyses to those using default settings. For both the commonly used exponential and the newly proposed compound Dirichlet prior distributions, the incorporation of relevant outside information improves inferences for data sets that have produced problematic branch- and tree-length estimates under default settings. We suggest that informed priors are worthy of further exploration for phylogenetics

    Species-level functional profiling of metagenomes and metatranscriptomes.

    Get PDF
    Functional profiles of microbial communities are typically generated using comprehensive metagenomic or metatranscriptomic sequence read searches, which are time-consuming, prone to spurious mapping, and often limited to community-level quantification. We developed HUMAnN2, a tiered search strategy that enables fast, accurate, and species-resolved functional profiling of host-associated and environmental communities. HUMAnN2 identifies a community's known species, aligns reads to their pangenomes, performs translated search on unclassified reads, and finally quantifies gene families and pathways. Relative to pure translated search, HUMAnN2 is faster and produces more accurate gene family profiles. We applied HUMAnN2 to study clinal variation in marine metabolism, ecological contribution patterns among human microbiome pathways, variation in species' genomic versus transcriptional contributions, and strain profiling. Further, we introduce 'contributional diversity' to explain patterns of ecological assembly across different microbial community types
    corecore