11,877 research outputs found
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
PURPOSE: The medical literature relevant to germline genetics is growing
exponentially. Clinicians need tools monitoring and prioritizing the literature
to understand the clinical implications of the pathogenic genetic variants. We
developed and evaluated two machine learning models to classify abstracts as
relevant to the penetrance (risk of cancer for germline mutation carriers) or
prevalence of germline genetic mutations. METHODS: We conducted literature
searches in PubMed and retrieved paper titles and abstracts to create an
annotated dataset for training and evaluating the two machine learning
classification models. Our first model is a support vector machine (SVM) which
learns a linear decision rule based on the bag-of-ngrams representation of each
title and abstract. Our second model is a convolutional neural network (CNN)
which learns a complex nonlinear decision rule based on the raw title and
abstract. We evaluated the performance of the two models on the classification
of papers as relevant to penetrance or prevalence. RESULTS: For penetrance
classification, we annotated 3740 paper titles and abstracts and used 60% for
training the model, 20% for tuning the model, and 20% for evaluating the model.
The SVM model achieves 89.53% accuracy (percentage of papers that were
correctly classified) while the CNN model achieves 88.95 % accuracy. For
prevalence classification, we annotated 3753 paper titles and abstracts. The
SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 %
accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts
as relevant to penetrance or prevalence. By facilitating literature review,
this tool could help clinicians and researchers keep abreast of the burgeoning
knowledge of gene-cancer associations and keep the knowledge bases for clinical
decision support tools up to date
The GUIDES checklist: development of a tool to improve the successful use of guideline-based computerised clinical decision support
Background: Computerised decision support (CDS) based on trustworthy clinical guidelines is a key component
of a learning healthcare system. Research shows that the effectiveness of CDS is mixed.
Multifaceted context, system, recommendation and implementation factors may potentially affect the success of
CDS interventions. This paper describes the development of a checklist that is intended to support professionals to
implement CDS successfully.
Methods: We developed the checklist through an iterative process that involved a systematic review of evidence
and frameworks, a synthesis of the success factors identified in the review, feedback from an international expert
panel that evaluated the checklist in relation to a list of desirable framework attributes, consultations with patients
and healthcare consumers and pilot testing of the checklist.
Results: We screened 5347 papers and selected 71 papers with relevant information on success factors for
guideline-based CDS. From the selected papers, we developed a 16-factor checklist that is divided in four domains,
i.e. the CDS context, content, system and implementation domains. The panel of experts evaluated the checklist
positively as an instrument that could support people implementing guideline-based CDS across a wide range of
settings globally. Patients and healthcare consumers identified guideline-based CDS as an important quality
improvement intervention and perceived the GUIDES checklist as a suitable and useful strategy.
Conclusions: The GUIDES checklist can support professionals in considering the factors that affect the success of
CDS interventions. It may facilitate a deeper and more accurate understanding of the factors shaping CDS
effectiveness. Relying on a structured approach may prevent that important factors are missed
Topic detection using paragraph vectors to support active learning in systematic reviews
AbstractSystematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/
Prioritising references for systematic reviews with RobotAnalyst: A user study
Screening references is a time-consuming step necessary for systematic reviews and guideline development. Previous studies have shown that human effort can be reduced by using machine learning software to prioritise large reference collections such that most of the relevant references are identified before screening is completed. We describe and evaluate RobotAnalyst, a Web-based software system that combines text-mining and machine learning algorithms for organising references by their content and actively prioritising them based on a relevancy classification model trained and updated throughout the process. We report an evaluation over 22 reference collections (most are related to public health topics) screened using RobotAnalyst with a total of 43 610 abstract-level decisions. The number of references that needed to be screened to identify 95% of the abstract-level inclusions for the evidence review was reduced on 19 of the 22 collections. Significant gains over random sampling were achieved for all reviews conducted with active prioritisation, as compared with only two of five when prioritisation was not used. RobotAnalyst's descriptive clustering and topic modelling functionalities were also evaluated by public health analysts. Descriptive clustering provided more coherent organisation than topic modelling, and the content of the clusters was apparent to the users across a varying number of clusters. This is the first large-scale study using technology-assisted screening to perform new reviews, and the positive results provide empirical evidence that RobotAnalyst can accelerate the identification of relevant studies. The results also highlight the issue of user complacency and the need for a stopping criterion to realise the work savings
Supporting systematic reviews using LDA-based document representations
BACKGROUND: Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW). METHODS: We explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation. RESULTS: Our results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain. CONCLUSIONS: A topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13643-015-0117-0) contains supplementary material, which is available to authorized users
- …