5 research outputs found

    Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

    Full text link
    PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools monitoring and prioritizing the literature to understand the clinical implications of the pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance (risk of cancer for germline mutation carriers) or prevalence of germline genetic mutations. METHODS: We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated dataset for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule based on the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule based on the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS: For penetrance classification, we annotated 3740 paper titles and abstracts and used 60% for training the model, 20% for tuning the model, and 20% for evaluating the model. The SVM model achieves 89.53% accuracy (percentage of papers that were correctly classified) while the CNN model achieves 88.95 % accuracy. For prevalence classification, we annotated 3753 paper titles and abstracts. The SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 % accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date

    Doctor of Philosophy

    Get PDF
    dissertationMedical knowledge learned in medical school can become quickly outdated given the tremendous growth of the biomedical literature. It is the responsibility of medical practitioners to continuously update their knowledge with recent, best available clinical evidence to make informed decisions about patient care. However, clinicians often have little time to spend on reading the primary literature even within their narrow specialty. As a result, they often rely on systematic evidence reviews developed by medical experts to fulfill their information needs. At the present, systematic reviews of clinical research are manually created and updated, which is expensive, slow, and unable to keep up with the rapidly growing pace of medical literature. This dissertation research aims to enhance the traditional systematic review development process using computer-aided solutions. The first study investigates query expansion and scientific quality ranking approaches to enhance literature search on clinical guideline topics. The study showed that unsupervised methods can improve retrieval performance of a popular biomedical search engine (PubMed). The proposed methods improve the comprehensiveness of literature search and increase the ratio of finding relevant studies with reduced screening effort. The second and third studies aim to enhance the traditional manual data extraction process. The second study developed a framework to extract and classify texts from PDF reports. This study demonstrated that a rule-based multipass sieve approach is more effective than a machine-learning approach in categorizing document-level structures and iv that classifying and filtering publication metadata and semistructured texts enhances the performance of an information extraction system. The proposed method could serve as a document processing step in any text mining research on PDF documents. The third study proposed a solution for the computer-aided data extraction by recommending relevant sentences and key phrases extracted from publication reports. This study demonstrated that using a machine-learning classifier to prioritize sentences for specific data elements performs equally or better than an abstract screening approach, and might save time and reduce errors in the full-text screening process. In summary, this dissertation showed that there are promising opportunities for technology enhancement to assist in the development of systematic reviews. In this modern age when computing resources are getting cheaper and more powerful, the failure to apply computer technologies to assist and optimize the manual processes is a lost opportunity to improve the timeliness of systematic reviews. This research provides methodologies and tests hypotheses, which can serve as the basis for further large-scale software engineering projects aimed at fully realizing the prospect of computer-aided systematic reviews

    The Adoption and Effectiveness of Automation in Health Evidence Synthesis

    Get PDF
    Background: Health systems worldwide are often informed by evidence-based guidelines which in turn rely heavily on systematic reviews. Systematic reviews are currently hindered by the increasing volume of new research and by its variable quality. Automation has potential to alleviate this problem but is not widely used in health evidence synthesis. This thesis sought to address the following: why is automation adopted (or not), and what effects does it have when it is put into use? / Methods: Roger’s Diffusion of Innovations theory, as a well-established and widely used framework, informed the study design and analysis. Adoption barriers and facilitators were explored through a thematic analysis of guideline developers’ opinions towards automation, and by mapping the adoption journey of a machine learning (ML) tool among Cochrane Information Specialists (CISs). A randomised trial of ML assistance in Risk of Bias (RoB) assessments and a cost-effectiveness analysis of a semi-automated workflow in the maintenance of a living evidence map each evaluated the effects of automation in practice. / Results: Adoption decisions are most strongly informed by the professional cultural expectations of health evidence synthesis. The stringent expectations of systematic reviewers and their users must be met before any other characteristic of an automation technology is considered by potential adopters. Ease-of-use increases in importance as a tool becomes more diffused across a population. Results of the randomised trial showed that ML-assisted RoB assessments were non-inferior to assessments completed entirely by human researcher effort. The cost-effectiveness analysis showed that a semi-automated workflow identified more relevant studies than the manual workflow and was less costly. / Conclusions: Automation can have substantial benefits when integrated into health evidence workflows. Wider adoption of automation tools will be facilitated by ensuring they are aligned with professional values of the field and limited in technical complexity
    corecore