5,437 research outputs found

    Knowledge Rich Natural Language Queries over Structured Biological Databases

    Full text link
    Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made

    Improved ontology for eukaryotic single-exon coding sequences in biological databases

    Get PDF
    Indexación: Scopus.Efficient extraction of knowledge from biological data requires the development of structured vocabularies to unambiguously define biological terms. This paper proposes descriptions and definitions to disambiguate the term 'single-exon gene'. Eukaryotic Single-Exon Genes (SEGs) have been defined as genes that do not have introns in their protein coding sequences. They have been studied not only to determine their origin and evolution but also because their expression has been linked to several types of human cancer and neurological/developmental disorders and many exhibit tissue-specific transcription. Unfortunately, the term 'SEGs' is rife with ambiguity, leading to biological misinterpretations. In the classic definition, no distinction is made between SEGs that harbor introns in their untranslated regions (UTRs) versus those without. This distinction is important to make because the presence of introns in UTRs affects transcriptional regulation and post-transcriptional processing of the mRNA. In addition, recent whole-transcriptome shotgun sequencing has led to the discovery of many examples of single-exon mRNAs that arise from alternative splicing of multi-exon genes, these single-exon isoforms are being confused with SEGs despite their clearly different origin. The increasing expansion of RNA-seq datasets makes it imperative to distinguish the different SEG types before annotation errors become indelibly propagated in biological databases. This paper develops a structured vocabulary for their disambiguation, allowing a major reassessment of their evolutionary trajectories, regulation, RNA processing and transport, and provides the opportunity to improve the detection of gene associations with disorders including cancers, neurological and developmental diseases. © The Author(s) 2018. Published by Oxford University Press.https://academic.oup.com/database/article/doi/10.1093/database/bay089/509943

    Distribution of biological databases over lowbandwidth networks

    Get PDF
    Databases are integral part of bioinformatics and need to be accessed most frequently, thus downloading and updating them on a regular basis is very critical. The establishment of bioinformatics research facility is a challenge for developing countries as they suffer from inherent low-bandwidth and unreliable internet connections. Therefore, the identification of techniques supporting download and automatic synchronization of large biological database at low bandwidth is of utmost importance. In current study, two protocols (FTP and Bit Torrent) were evaluated and the utility of a BitTorren based peer-to-peer (btP2P) file distribution model for automatic synchronization and distribution of large dataset at our facility in Pakistan have been discussed

    Creating NoSQL Biological Databases with Ontologies for Query Relaxation

    Get PDF
    AbstractThe complexity of building biological databases is well-known and ontologies play an extremely important role in biological databases. However, much of the emphasis on the role of ontologies in biological databases has been on the construction of databases. In this paper, we explore a somewhat overlooked aspect regarding ontologies in biological databases, namely, how ontologies can be used to assist better database retrieval. In particular, we show how ontologies can be used to revise user submitted queries for query relaxation. In addition, since our research is conducted at today's “big data” era, our investigation is centered on NoSQL databases which serve as a kind of “representatives” of big data. This paper contains two major parts: First we describe our methodology of building two NoSQL application databases (MongoDB and AllegroGraph) using GO ontology, and then discuss how to achieve query relaxation through GO ontology. We report our experiments and show sample queries and results. Our research on query relaxation on NoSQL databases is complementary to existing work in big data and in biological databases and deserves further exploration

    Publishing Interactive Articles: Integrating Journals And Biological Databases

    Get PDF
    In collaboration with the journal GENETICS, we've developed and launched a pipeline by which interactive full-text HTML/PDF journal articles are published with named entities linked to corresponding resource pages in "WormBase":http://www.wormbase.org/ (WB). Our interactive articles allow a reader to click on over ten different data type objects (gene, protein, transgene, etc.) and be directed to the relevant webpage. This seamless connection from the article to summaries of data types promotes a deeper level of understanding for the naïve reader, and incisive evaluation for the sophisticated reader. Further, this collaboration allows us to identify and collect information before the publication of the article. The pipeline uses automated recognition scripts to identify entities that already exist in the database and a self-reporting form we created at WB that is sent to the author by GENETICS for submitting entities that do not already exist in our database. We include a manual quality control step to make sure ambiguous links are corrected, and that all new entities have been reported and linked properly. The automated entity recognition scripts allows us to potentially link any object found in a database as well as to expand this pipeline to other databases. We have already adapted this pipeline for linking _Saccharomyces cerevisiae_ GENETICS articles to the "Saccharomyces Genome Database":http://www.yeastgenome.org/ (SGD) and are currently expanding this pipeline for linking genes in _Drosophila_ articles to "FlyBase":http://flybase.org/. By integrating journals and databases, we are integrating the major modes of communication in the biological sciences, which will undoubtedly increase the pace of discovery.
&#xa

    Multi-Faceted Search and Navigation of Biological Databases

    Get PDF

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Toward an interactive article: integrating journals and biological databases.

    Get PDF
    BACKGROUND: Journal articles and databases are two major modes of communication in the biological sciences, and thus integrating these critical resources is of urgent importance to increase the pace of discovery. Projects focused on bridging the gap between journals and databases have been on the rise over the last five years and have resulted in the development of automated tools that can recognize entities within a document and link those entities to a relevant database. Unfortunately, automated tools cannot resolve ambiguities that arise from one term being used to signify entities that are quite distinct from one another. Instead, resolving these ambiguities requires some manual oversight. Finding the right balance between the speed and portability of automation and the accuracy and flexibility of manual effort is a crucial goal to making text markup a successful venture. RESULTS: We have established a journal article mark-up pipeline that links GENETICS journal articles and the model organism database (MOD) WormBase. This pipeline uses a lexicon built with entities from the database as a first step. The entity markup pipeline results in links from over nine classes of objects including genes, proteins, alleles, phenotypes and anatomical terms. New entities and ambiguities are discovered and resolved by a database curator through a manual quality control (QC) step, along with help from authors via a web form that is provided to them by the journal. New entities discovered through this pipeline are immediately sent to an appropriate curator at the database. Ambiguous entities that do not automatically resolve to one link are resolved by hand ensuring an accurate link. This pipeline has been extended to other databases, namely Saccharomyces Genome Database (SGD) and FlyBase, and has been implemented in marking up a paper with links to multiple databases. CONCLUSIONS: Our semi-automated pipeline hyperlinks articles published in GENETICS to model organism databases such as WormBase. Our pipeline results in interactive articles that are data rich with high accuracy. The use of a manual quality control step sets this pipeline apart from other hyperlinking tools and results in benefits to authors, journals, readers and databases.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
    corecore