36 research outputs found

    Automatic categorization of diverse experimental information in the bioscience literature

    Get PDF
    Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort

    Analysis of 14 BAC sequences from the Aedes aegypti genome: a benchmark for genome annotation and assembly

    Get PDF
    In order to provide a set of manually curated and annotated sequences from the Aedes aegypti genome, mapped BAC clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using computational gene-finding, EST matches as well as comparative protein homology

    The Evolution of the Anopheles 16 Genomes Project

    Get PDF
    We report the imminent completion of a set of reference genome assemblies for 16 species of Anopheles mosquitoes. In addition to providing a generally useful resource for comparative genomic analyses, these genome sequences will greatly facilitate exploration of the capacity exhibited by some Anopheline mosquito species to serve as vectors for malaria parasites. A community analysis project will commence soon to perform a thorough comparative genomic investigation of these newly sequenced genomes. Completion of this project via the use of short next-generation sequence reads required innovation in both the bioinformatic and laboratory realms, and the resulting knowledge gained could prove useful for genome sequencing projects targeting other unconventional genomes

    Sizes of Long RNA Molecules Are Determined by the Branching Patterns of Their Secondary Structures

    Get PDF
    Long RNA molecules are at the core of gene regulation across all kingdoms of life, whilst also serving as genomes in RNA viruses. Few studies have addressed the basic physical properties of long single-stranded RNAs. Long RNAs with non-repeating sequences usually adopt highly ramified secondary structures and are better described as branched polymers. In order to test whether a branched polymer model can estimate the overall sizes of large RNAs we employed fluorescence correlation spectroscopy to examine the hydrodynamic radii of a broad spectrum of biologically important RNAs, ranging from viral genomes to long non-coding regulatory RNAs. The relative sizes of long RNAs measured at low ionic strength correspond well to those predicted by two theoretical approaches that treat the effective branching associated with secondary structure formation – one employing the Kramers theorem for calculating radii of gyration, and the other featuring the metric of “maximum ladder distance”. Upon addition of multivalent cations, most RNAs are found to be compacted as compared with their original, low-ionic-strength sizes. These results suggest that sizes of long RNAmolecules are determined by the branching pattern of their secondary structures. They also experimentally validate the proposed computational approaches for estimating hydrodynamic radii of single-stranded RNAs, which use generic RNA structure prediction tools and thus can be universally applied to a wide range of long RNAs

    VectorBase: a home for invertebrate vectors of human pathogens

    Get PDF
    VectorBase () is a web-accessible data repository for information about invertebrate vectors of human pathogens. VectorBase annotates and maintains vector genomes providing an integrated resource for the research community. Currently, VectorBase contains genome information for two organisms: Anopheles gambiae, a vector for the Plasmodium protozoan agent causing malaria, and Aedes aegypti, a vector for the flaviviral agents causing Yellow fever and Dengue fever

    Role of RNA Branchedness in the Competition for Viral Capsid Proteins

    No full text
    To optimize bindingand packagingby their capsid proteins (CP), single-stranded (ss) RNA viral genomes often have local secondary/tertiary structures with high CP affinity, with these “packaging signals” serving as heterogeneous nucleation sites for the formation of capsids. Under typical <i>in vitro</i> self-assembly conditions, however, and in particular for the case of many ssRNA viruses whose CP have cationic N-termini, the adsorption of CP by RNA is nonspecific because the CP concentration exceeds the largest dissociation constant for CP–RNA binding. Consequently, the RNA is saturated by bound protein before lateral interactions between CP drive the homogeneous nucleation of capsids. But, before capsids are formed, the binding of protein remains reversible and introduction of another RNA specieswith a different length and/or sequenceis found experimentally to result in significant redistribution of protein. Here we argue that, for a given RNA mass, the sequence with the highest affinity for protein is the one with the most compact secondary structure arising from self-complementarity; similarly, a long RNA steals protein from an equal mass of shorter ones. In both cases, it is the lateral attractions between bound proteins that determines the relative CP affinities of the RNA templates, even though the individual binding sites are identical. We demonstrate this with Monte Carlo simulations, generalizing the Rosenbluth method for excluded-volume polymers to include branching of the polymers and their reversible binding by protein

    Characterization of Viral Capsid Protein Self-Assembly around Short Single-Stranded RNA

    No full text
    For many viruses, the packaging of a single-stranded RNA (ss-RNA) genome is spontaneous, driven by capsid protein–capsid protein (CP) and CP–RNA interactions. Furthermore, for some multipartite ss-RNA viruses, copackaging of two or more RNA molecules is a common strategy. Here we focus on RNA copackaging <i>in vitro</i> by using cowpea chlorotic mottle virus (CCMV) CP and an RNA molecule that is short (500 nucleotides (nts)) compared to the lengths (≈3000 nts) packaged in wild-type virions. We show that the degree of cooperativity of virus assembly depends not only on the relative strength of the CP–CP and CP–RNA interactions but also on the RNA being short: a 500-nt RNA molecule cannot form a capsid by itself, so its packaging requires the aggregation of multiple CP–RNA complexes. By using fluorescence correlation spectroscopy (FCS), we show that at neutral pH and sufficiently low concentrations RNA and CP form complexes that are smaller than the wild-type capsid and that four 500-nt RNAs are packaged into virus-like particles (VLPs) only upon lowering the pH. Further, a variety of bulk-solution techniques confirm that fully ordered VLPs are formed only upon acidification. On the basis of these results, we argue that the observed high degree of cooperativity involves equilibrium between multiple CP/RNA complexes
    corecore