1,668 research outputs found

    Specimens at the Center: An Informatics Workflow and Toolkit for Specimen-level analysis of Public DNA database data

    Get PDF
    Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species, two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex (Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens. Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or, conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group

    BioCloud Search EnGene: Surfing Biological Data on the Cloud

    Get PDF
    The massive production and spread of biomedical data around the web introduces new challenges related to identify computational approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud application that facilitates searching and integration of the many layers of biological information offered by public large-scale genomic repositories. Grounding on the concept of dataspace, BSE is built on top of a cloud platform that severely curtails issues associated with scalability and performance. Like popular online gene portals, BSE adopts a gene-centric approach: researchers can find their information of interest by means of a simple “Google-like” query interface that accepts standard gene identification as keywords. We present BSE architecture and functionality and discuss how our strategies contribute to successfully tackle big data problems in querying gene-based web resources. BSE is publically available at: http://biocloud-unica.appspot.com/

    Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

    Get PDF
    BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS: We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS: The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries)

    Biodiversity databanks and scientific exploration

    Get PDF
    For several decades now, biologists have been developing digital databanks, which are remarkable scientific instruments allowing scientists to accelerate the development of biological knowledge. From the beginnings of the Human Genome Project (HGP) onwards, genetic databanks have been a major component of current biological knowledge, and biodiversity databanks have also been developed in the wake of the HGP. The purpose of this paper is to identify the specific features of biodiversity data and databanks, and to point out their contribution to biodiversity knowledge

    Integration of Biological Sources: Exploring the Case of Protein Homology

    Get PDF
    Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration

    Viral communities associated with healthy and bleaching corals

    Get PDF
    The coral holobiont is the integrated assemblage of the coral animal, its symbiotic algae, protists, fungi and a diverse consortium of Bacteria and Archaea. Corals are a model system for the study of symbiosis, the breakdown of which can result in disease and mortality. Little is known, however, about viruses that infect corals and their symbionts. Here we present metagenomic analyses of the viral communities associated with healthy and partially bleached specimens of the Caribbean reef-building coral Diploria strigosa. Surprisingly, herpes-like sequences accounted for 4–8% of the total sequences in each metagenome; this abundance of herpes-like sequences is unprecedented in other marine viral metagenomes. Viruses similar to those that infect algae and plants were also present in the coral viral assemblage. Among the phage identified, cyanophages were abundant in both healthy and bleaching corals and vibriophages were also present. Therefore, coral-associated viruses could potentially infect all components of the holobiont – coral, algal and microbial. Thus, we expect viruses to figure prominently in the preservation and breakdown of coral health

    Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE – binding linear epitopes of allergens

    Get PDF
    BACKGROUND: Transgenic proteins expressed by genetically modified food crops are evaluated for their potential allergenic properties prior to marketing, among others by identification of short identical amino acid sequences that occur both in the transgenic protein and allergenic proteins. A strategy is proposed, in which the positive outcomes of the sequence comparison with a minimal length of six amino acids are further screened for the presence of potential linear IgE-epitopes. This double track approach involves the use of literature data on IgE-epitopes and an antigenicity prediction algorithm. RESULTS: Thirty-three transgenic proteins have been screened for identities of at least six contiguous amino acids shared with allergenic proteins. Twenty-two transgenic proteins showed positive results of six- or seven-contiguous amino acids length. Only a limited number of identical stretches shared by transgenic proteins (papaya ringspot virus coat protein, acetolactate synthase GH50, and glyphosate oxidoreductase) and allergenic proteins could be identified as (part of) potential linear epitopes. CONCLUSION: Many transgenic proteins have identical stretches of six or seven amino acids in common with allergenic proteins. Most identical stretches are likely to be false positives. As shown in this study, identical stretches can be further screened for relevance by comparison with linear IgE-binding epitopes described in literature. In the absence of literature data on epitopes, antigenicity prediction by computer aids to select potential antibody binding sites that will need verification of IgE binding by sera binding tests. Finally, the positive outcomes of this approach warrant further clinical testing for potential allergenicity

    Pharmacognostic Investigation of the Leaves of Mentha cordifolia and Its DNA Fingerprints

    Get PDF
    Objective: To perform microscopic investigation of Mentha cordifolia leaf and determine DNA fingerprint of the plant.Methods: Tissues of M. cordifolia, M. arvensis var piperascens and M. spicata were investigated under light microscope. DNA marker for M. cordifolia was established by amplification of the internal transcribed spacer (ITS) of nuclear ribosomal RNA gene using conserved plant sequences as primers and fragmentation with the restriction enzyme Bsm1.Results: The results showed the presence of uniseriate epidermal cells covered by a fine cuticle layer, glandular trichomes of multicellular type, capitate and peltate, and non-glandular trichomes. The fragmentation pattern of ITS of nuclear ribosomal RNA gene was applied as DNA marker for discrimination of M. cordifolia from M. arvensis var piperascens and M. spicata.Conclusion: Pharmacognostic investigation of M. cordifolia leaf exhibited the characteristic of Mentha species and the plant is distinguishable from M. arvensis var piperascens and M. spicata by DNA fingerprint pattern.Keywords: Mentha cordifolia, microscopic investigation, DNA marker, internal transcribed space
    corecore