1,338 research outputs found
Automatic categorization of diverse experimental information in the bioscience literature
Background:
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
Results:
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Conclusions:
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort
UIMA in the Biocuration Workflow: A coherent framework for cooperation between biologists and computational linguists
As collaborating partners, Barcelona Media Innovation Centre and GRIB (Universitat Pompeu Fabra) seek to combine strengths from Computational Linguistics and Biomedicine to produce a robust Text Mining system to generate data that will help biocurators in their daily work. The first version of this system will focus on the discovery of relationships between genes, SNPs (Single Nucleotide Polymorphisms) and diseases from the literature.

A first challenge that we were faced with during the setup of this project is the fact that most current tools that support the curation workflow are complex, ad-hoc built applications which sometimes make difficult the interoperability and results sharing between research groups from different and unrelated expert fields. Often, biologists (even computer-savvy ones) are hard pressed to use and adapt sophisticated Natural Language Processing systems, and computational linguists are challenged by the intricacies of biology in applying their processing pipelines to elicit knowledge from texts. The flow of knowledge (needed to develop a usable, practical tool) to and from the parties involved in the development of such systems is not always easy or straightforward.

The modular and versatile architecture of UIMA (Unstructed Information Management Architecture) provides a framework to address these challenges. UIMA is a component architecture and software framework implementation (including a UIMA SDK) to develop applications that analyse large volumes of unstructured information, and has been increasingly adopted by a significant part of the BioNLP community that needs industrial-grade and robust applications to exploit the whole bibliome. The use of UIMA to develop Text Mining applications useful for curation purposes allows the combination of diverse expertises which is beyond the individual know-how of biologists, computer scientists or linguists in isolation. A good synergy and circulation of knowledge between these experts is fundamental to the development of a successful curation tool
The future of annotation/biocuration
Talk given at 3rd International Biocuration Conference. Berlin 200
Development of GCP Ontology for sharing crop information
Poster presented at 3rd International Biocuration Conference. Berlin (Germany), 17 Apr 200
Integrative biological simulation praxis: Considerations from physics, philosophy, and data/model curation practices
Integrative biological simulations have a varied and controversial history in
the biological sciences. From computational models of organelles, cells, and
simple organisms, to physiological models of tissues, organ systems, and
ecosystems, a diverse array of biological systems have been the target of
large-scale computational modeling efforts. Nonetheless, these research agendas
have yet to prove decisively their value among the broader community of
theoretical and experimental biologists. In this commentary, we examine a range
of philosophical and practical issues relevant to understanding the potential
of integrative simulations. We discuss the role of theory and modeling in
different areas of physics and suggest that certain sub-disciplines of physics
provide useful cultural analogies for imagining the future role of simulations
in biological research. We examine philosophical issues related to modeling
which consistently arise in discussions about integrative simulations and
suggest a pragmatic viewpoint that balances a belief in philosophy with the
recognition of the relative infancy of our state of philosophical
understanding. Finally, we discuss community workflow and publication practices
to allow research to be readily discoverable and amenable to incorporation into
simulations. We argue that there are aligned incentives in widespread adoption
of practices which will both advance the needs of integrative simulation
efforts as well as other contemporary trends in the biological sciences,
ranging from open science and data sharing to improving reproducibility.Comment: 10 page
Systematic analysis of primary sequence domain segments for the discrimination between class C GPCR subtypes
G-protein-coupled receptors (GPCRs) are a large and diverse super-family of eukaryotic cell membrane proteins that play an important physiological role as transmitters of extracellular signal. In this paper, we investigate Class C, a member of this super-family that has attracted much attention in pharmacology. The limited knowledge about the complete 3D crystal structure of Class C receptors makes necessary the use of their primary amino acid sequences for analytical purposes. Here, we provide a systematic analysis of distinct receptor sequence segments with regard to their ability to differentiate between seven class C GPCR subtypes according to their topological location in the extracellular, transmembrane, or intracellular domains. We build on the results from the previous research that provided preliminary evidence of the potential use of separated domains of complete class C GPCR sequences as the basis for subtype classification. The use of the extracellular N-terminus domain alone was shown to result in a minor decrease in subtype discrimination in comparison with the complete sequence, despite discarding much of the sequence information. In this paper, we describe the use of Support Vector Machine-based classification models to evaluate the subtype-discriminating capacity of the specific topological sequence segments.Peer ReviewedPostprint (author's final draft
- …
