517 research outputs found

    Improving the Performance of Text Categorization Models used for the Selection of High Quality Articles

    Get PDF
    Objectives: Machine learning systems can considerably reduce the time and effort needed by experts to perform new systematic reviews (SRs). This study investigates categorization models, which are trained on a combination of included and commonly excluded articles, which can improve performance by identifying high quality articles for new procedures or drug SRs. Methods: Test collections were built using the annotated reference files from 19 procedure and 15 drug systematic reviews. The classification models, using a support vector machine, were trained by the combined even data of other topics, excepting the desired topic. This approach was compared to the combination of included and commonly excluded articles with the combination of included and excluded articles. Accuracy was used for the measure of comparison. Results: On average, the performance was improved by about 15 % in the procedure topics and 11 % in the drug topics when the classification models trained on the combination of articles included and commonly excluded, were used. The system using the combination of included and commonly excluded articles performed better than the combination of included and excluded articles in all of the procedure topics. Conclusions: Automatically rigorous article classification using machine learning can reduce the workload of experts when they perform systematic reviews when the topic-specific data are scarce. In particular, when the combinatio

    Semantic Approaches for Knowledge Discovery and Retrieval in Biomedicine

    Get PDF

    GeneRIF indexing: sentence selection based on machine learning

    Get PDF

    MScanner: a classifier for retrieving Medline citations

    Get PDF
    BACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSION: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu

    CiteFinder: a System to Find and Rank Medical Citations

    Get PDF
    This thesis presents CiteFinder, a system to find relevant citations for clinicians\u27 written content. Inclusion of citations for clinical information content makes the content more reliable through the provision of scientific articles as references, and enables clinicians to easily update their written content using new information. The proposed approach splits the content into sentences, identifies the sentences that need to be supported with citations by applying classification algorithms, and uses information retrieval and ranking techniques to extract and rank relevant citations from MEDLINE for any given sentence. Additionally, this system extracts snippets from the retrieved articles. We assessed our approach on 3,699 MEDLINE papers on the subject of Heart Failure . We implemented multi-level and weight ranking algorithms to rank the citations. This study shows that using Journal priority and Study Design type significantly improves results obtained with the traditional approach of only using the text of articles, by approximately 63%. We also show that using the full-text, rather than just the abstract text, leads to extraction of higher quality snippets

    Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models

    Full text link
    Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper adaptation of the algorithms used to deal with this challenging classification task. Results: The ensemble method we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. During the BioASQ 2014 challenge we obtained the first place during the first batch and the third in the two following batches. Our success in the BioASQ challenge proved that a fully automated machine-learning approach, which does not implement any heuristics and rule-based approaches, can be highly competitive and outperform other approaches in similar challenging contexts

    Answering clinical questions with knowledge-based and statistical techniques

    Get PDF
    The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical medicine. This article presents a system designed to satisfy the information needs of physicians practicing evidence-based medicine. We have developed a series of knowledge extractors, which employ a combination of knowledge-based and statistical techniques, for automatically identifying clinically relevant aspects of MEDLINE abstracts. These extracted elements serve as the input to an algorithm that scores the relevance of citations with respect to structured representations of information needs, in accordance with the principles of evidencebased medicine. Starting with an initial list of citations retrieved by PubMed, our system can bring relevant abstracts into higher ranking positions, and from these abstracts generate responses that directly answer physicians ’ questions. We describe three separate evaluations: one focused on the accuracy of the knowledge extractors, one conceptualized as a document reranking task, and finally, an evaluation of answers by two physicians. Experiments on a collection of real-world clinical questions show that our approach significantly outperforms the already competitive PubMed baseline. 1
    • …
    corecore