115 research outputs found

    Literature classification for semi-automated updating of biological knowledgebases

    Get PDF
    BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. RESULTS: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. CONCLUSION: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases

    Chi-square-based scoring function for categorization of MEDLINE citations

    Full text link
    Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.Comment: 34 pages, 2 figure

    Classification of the Universe of Immune Epitope Literature: Representation and Knowledge Gaps

    Get PDF
    A significant fraction of the more than 18 million scientific articles currently indexed in the PubMed database are related to immune responses to various agents, including infectious microbes, autoantigens, allergens, transplants, cancer antigens and others. The Immune Epitope Database (IEDB) is an online repository that catalogs immune epitope reactivity data derived from articles listed in the National Library of Medicine PubMed database. The IEDB is maintained and continually updated by monitoring PubMed for new, potentially relevant references.Herein we detail the classification of all epitope-specific literature in over 100 different immunological domains representing Infectious Diseases and Microbes, Autoimmunity, Allergy, Transplantation and Cancer. The relative number of references in each category reflects past and present areas of research on immune reactivities. In addition to describing the overall landscape of data distribution, this particular characterization of the epitope reference data also allows for the exploration of possible correlations with global disease morbidity and mortality data.While in most cases diseases associated with high morbidity and mortality rates were amongst the most studied, a number of high impact diseases such as dengue, Schistosoma, HSV-2, B. pertussis and Chlamydia trachoma, were found to have very little coverage. The data analyzed in this fashion represents the first estimate of how reported immunological data corresponds to disease-related morbidity and mortality, and confirms significant discrepancies in the overall research foci versus disease burden, thus identifying important gaps to be pursued by future research. These findings may also provide a justification for redirecting a portion of research funds into some of the underfunded, critical disease areas

    MScanner: a classifier for retrieving Medline citations

    Get PDF
    BACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSION: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu

    BioReader: a text mining tool for performing classification of biomedical literature

    Get PDF
    Abstract Background Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task. Results We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts. BioReader (Biomedical Research Article Distiller) is trained by uploading article corpora for two training categories - e.g. one positive and one negative for content of interest - as well as one corpus of abstracts to be classified and/or a search string to query PubMed for articles. The corpora are submitted as lists of PubMed IDs and the abstracts are automatically downloaded from PubMed, preprocessed, and the unclassified corpus is classified using the best performing classification algorithm out of ten implemented algorithms. Conclusion BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage of the vast amounts of data and information in the published literature. BioReader outperforms existing tools with similar functionalities and expands the features used for mining literature in database curation efforts. The tool is freely available as a web service at http://www.cbs.dtu.dk/services/BioReade

    Design and utilization of epitope-based databases and predictive tools

    Get PDF
    In the last decade, significant progress has been made in expanding the scope and depth of publicly available immunological databases and online analysis resources, which have become an integral part of the repertoire of tools available to the scientific community for basic and applied research. Herein, we present a general overview of different resources and databases currently available. Because of our association with the Immune Epitope Database and Analysis Resource, this resource is reviewed in more detail. Our review includes aspects such as the development of formal ontologies and the type and breadth of analytical tools available to predict epitopes and analyze immune epitope data. A common feature of immunological databases is the requirement to host large amounts of data extracted from disparate sources. Accordingly, we discuss and review processes to curate the immunological literature, as well as examples of how the curated data can be used to generate a meta-analysis of the epitope knowledge currently available for diseases of worldwide concern, such as influenza and malaria. Finally, we review the impact of immunological databases, by analyzing their usage and citations, and by categorizing the type of citations. Taken together, the results highlight the growing impact and utility of immunological databases for the scientific community

    Automated systems to identify relevant documents in product risk management

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Product risk management involves critical assessment of the risks and benefits of health products circulating in the market. One of the important sources of safety information is the primary literature, especially for newer products which regulatory authorities have relatively little experience with. Although the primary literature provides vast and diverse information, only a small proportion of which is useful for product risk assessment work. Hence, the aim of this study is to explore the possibility of using text mining to automate the identification of useful articles, which will reduce the time taken for literature search and hence improving work efficiency. In this study, term-frequency inverse document-frequency values were computed for predictors extracted from the titles and abstracts of articles related to three tumour necrosis factors-alpha blockers. A general automated system was developed using only general predictors and was tested for its generalizability using articles related to four other drug classes. Several specific automated systems were developed using both general and specific predictors and training sets of different sizes in order to determine the minimum number of articles required for developing such systems.</p> <p>Results</p> <p>The general automated system had an area under the curve value of 0.731 and was able to rank 34.6% and 46.2% of the total number of 'useful' articles among the first 10% and 20% of the articles presented to the evaluators when tested on the generalizability set. However, its use may be limited by the subjective definition of useful articles. For the specific automated system, it was found that only 20 articles were required to develop a specific automated system with a prediction performance (AUC 0.748) that was better than that of general automated system.</p> <p>Conclusions</p> <p>Specific automated systems can be developed rapidly and avoid problems caused by subjective definition of useful articles. Thus the efficiency of product risk management can be improved with the use of specific automated systems.</p

    Biomedical text mining applied to document retrieval and semantic indexing

    Get PDF
    In Biomedical research, the ability to retrieve the adequate information from the ever growing literature is an extremely important asset. This work provides an enhanced and general purpose approach to the process of document retrieval that enables the filtering of PubMed query results. The system is based on semantic indexing providing, for each set of retrieved documents, a network that links documents and relevant terms obtained by the annotation of biological entities (e.g. genes or proteins). This network provides distinct user perspectives and allows navigation over documents with similar terms and is also used to assess document relevance. A network learning procedure, based on previous work from e-mail spam filtering, is proposed, receiving as input a training set of manually classified documents

    Enhancing navigation in biomedical databases by community voting and database-driven text classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.</p> <p>Results</p> <p>Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.</p> <p>Conclusion</p> <p>Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.</p> <p>The system can be accessed at <url>http://pepbank.mgh.harvard.edu</url>.</p
    corecore