9 research outputs found

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

    Get PDF
    BACKGROUND: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. RESULTS: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. CONCLUSIONS: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods

    Ontologies as bridges between data sources and user queries: the KNOWMAK project experience

    Get PDF
    This paper describes ongoing work in the KNOWMAK project, which aims to develop a webbased tool providing interactive visualisations and state-of-the-art indicators on knowledge cocreation in the European research area. One of the main novel developments in this work is the use of ontologies to act as a bridge between the data sources (research projects, patents and publications) and user queries, in order to address the problems of mapping between heterogenous data sources with different vocabularies while still maintaining a level of standardization necessary for summarising the information required to provide informative views about the highly dynamic S&T landscape

    EnvMine: A text-mining system for the automatic extraction of contextual information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles). So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations) from textual sources of any kind.</p> <p>Results</p> <p>EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved) of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings.</p> <p>Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude), thus allowing the calculation of distance between the individual locations.</p> <p>Conclusion</p> <p>EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical variables of sampling sites, thus facilitating the performance of ecological analyses. EnvMine can also help in the development of standards for the annotation of environmental features.</p

    Bridging the food security gap: an information-led approach to connect dietary nutrition, food composition and crop production

    Get PDF
    © 2019 Society of Chemical Industry BACKGROUND: Food security is recognized as a major global challenge, yet human food-chain systems are inherently not geared towards nutrition, with decisions on crop and cultivar choice not informed by dietary composition. Currently, food compositional tables and databases (FCT/FCDB) are the primary information sources for decisions relating to dietary intake. However, these only present single mean values representing major components. Establishment of a systematic controlled vocabulary to fill this gap requires representation of a more complex set of semantic relationships between terms used to describe nutritional composition and dietary function. RESULTS: We carried out a survey of 11 FCT/FCDB and 177 peer-reviewed papers describing variation in nutritional composition and dietary function for food crops to identify a comprehensive set of terms to construct a controlled vocabulary. We used this information to generate a Crop Dietary Nutrition Data Framework (CDN-DF), which incorporates controlled vocabularies systematically organized into major classes representing nutritional components and dietary functions. We demonstrate the value of the CDN-DF for comparison of equivalent components between crop species or cultivars, for identifying data gaps and potential for formal meta-analysis. The CDN-DF also enabled us to explore relationships between nutritional components and the functional attributes of food. CONCLUSION: We have generated a structured crop dietary nutrition data framework, which is generally applicable to the collation and comparison of data relevant to crop researchers, breeders, and other stakeholders, and will facilitate dialogue with nutritionists. It is currently guiding the establishment of a more robust formal ontology. © 2019 Society of Chemical Industry

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining-3

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Facilitating the development of controlled vocabularies for metabolomics technologies with text mining"</p><p>http://www.biomedcentral.com/1471-2105/9/S5/S5</p><p>BMC Bioinformatics 2008;9(Suppl 5):S5-S5.</p><p>Published online 29 Apr 2008</p><p>PMCID:PMC2367623.</p><p></p

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining-2

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Facilitating the development of controlled vocabularies for metabolomics technologies with text mining"</p><p>http://www.biomedcentral.com/1471-2105/9/S5/S5</p><p>BMC Bioinformatics 2008;9(Suppl 5):S5-S5.</p><p>Published online 29 Apr 2008</p><p>PMCID:PMC2367623.</p><p></p

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining-4

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Facilitating the development of controlled vocabularies for metabolomics technologies with text mining"</p><p>http://www.biomedcentral.com/1471-2105/9/S5/S5</p><p>BMC Bioinformatics 2008;9(Suppl 5):S5-S5.</p><p>Published online 29 Apr 2008</p><p>PMCID:PMC2367623.</p><p></p

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org
    corecore