5 research outputs found

    Using of Natural Language Processing Techniques in Suicide Research

    Get PDF
    It is estimated that each year many people, most of whom are teenagers and young adults die by suicide worldwide. Suicide receives special attention with many countries developing national strategies for prevention. Since, more medical information is available in text, Preventing the growing trend of suicide in communities requires analyzing various textual resources, such as patient records, information on the web or questionnaires. For this purpose, this study systematically reviews recent studies related to the use of natural language processing techniques in the area of people’s health who have completed suicide or are at risk. After electronically searching for the PubMed and ScienceDirect databases and studying articles by two reviewers, 21 articles matched the inclusion criteria. This study revealed that, if a suitable data set is available, natural language processing techniques are well suited for various types of suicide related research

    FuzzyBIO: a proposal for fuzzy representation of discontinuous entities

    Get PDF
    Algorithms and the Foundations of Software technolog

    Clinical text data in machine learning: Systematic review

    Get PDF
    Background: Clinical narratives represent the main form of communication within healthcare providing a personalized account of patient history and assessments, offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective: The main aim of this study is to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigate the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multi-faceted interface, to perform a literature search against MEDLINE. We identified a total of 110 relevant studies and extracted information about the text data used to support machine learning, the NLP tasks supported and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation and any relevant statistics. Results: The vast majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable due to sensitive nature of data considered. Beside the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The vast majority of studies focused on the task of text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management and surveillance. Conclusions: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which does not require data annotation

    Improving Syntactic Parsing of Clinical Text Using Domain Knowledge

    Get PDF
    Syntactic parsing is one of the fundamental tasks of Natural Language Processing (NLP). However, few studies have explored syntactic parsing in the medical domain. This dissertation systematically investigated different methods to improve the performance of syntactic parsing of clinical text, including (1) Constructing two clinical treebanks of discharge summaries and progress notes by developing annotation guidelines that handle missing elements in clinical sentences; (2) Retraining four state-of-the-art parsers, including the Stanford parser, Berkeley parser, Charniak parser, and Bikel parser, using clinical treebanks, and comparing their performance to identify better parsing approaches; and (3) Developing new methods to reduce syntactic ambiguity caused by Prepositional Phrase (PP) attachment and coordination using semantic information. Our evaluation showed that clinical treebanks greatly improved the performance of existing parsers. The Berkeley parser achieved the best F-1 score of 86.39% on the MiPACQ treebank. For PP attachment, our proposed methods improved the accuracies of PP attachment by 2.35% on the MiPACQ corpus and 1.77% on the I2b2 corpus. For coordination, our method achieved a precision of 94.9% and a precision of 90.3% for the MiPACQ and i2b2 corpus, respectively. To further demonstrate the effectiveness of the improved parsing approaches, we applied outputs of our parsers to two external NLP tasks: semantic role labeling and temporal relation extraction. The experimental results showed that performance of both tasks’ was improved by using the parse tree information from our optimized parsers, with an improvement of 3.26% in F-measure for semantic role labelling and an improvement of 1.5% in F-measure for temporal relation extraction
    corecore