333 research outputs found

    Toward a Neural Semantic Parsing System for EHR Question Answering

    Full text link
    Clinical semantic parsing (SP) is an important step toward identifying the exact information need (as a machine-understandable logical form) from a natural language query aimed at retrieving information from electronic health records (EHRs). Current approaches to clinical SP are largely based on traditional machine learning and require hand-building a lexicon. The recent advancements in neural SP show a promise for building a robust and flexible semantic parser without much human effort. Thus, in this paper, we aim to systematically assess the performance of two such neural SP models for EHR question answering (QA). We found that the performance of these advanced neural models on two clinical SP datasets is promising given their ease of application and generalizability. Our error analysis surfaces the common types of errors made by these models and has the potential to inform future research into improving the performance of neural SP models for EHR QA.Comment: Accepted at the AMIA Annual Symposium 2022 (10 pages, 5 tables, 1 figure

    LeafAI: query generator for clinical cohort discovery rivaling a human programmer

    Full text link
    Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria. Materials and Methods: The task of query creation from eligibility criteria requires solving several text-processing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries. Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI. Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival a human programmer in finding patients eligible for clinical trials

    emrQA: A large corpus for question answering on electronic medical records

    Get PDF
    We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping

    An Approach to Select Cost-Effective Risk Countermeasures Exemplified in CORAS

    Get PDF
    Risk is unavoidable in business and risk management is needed amongst others to set up good security policies. Once the risks are evaluated, the next step is to decide how they should be treated. This involves managers making decisions on proper countermeasures to be implemented to mitigate the risks. The countermeasure expenditure, together with its ability to mitigate risks, is factors that affect the selection. While many approaches have been proposed to perform risk analysis, there has been less focus on delivering the prescriptive and specific information that managers require to select cost-effective countermeasures. This paper proposes a generic approach to integrate the cost assessment into risk analysis to aid such decision making. The approach makes use of a risk model which has been annotated with potential countermeasures, estimates for their cost and effect. A calculus is then employed to reason about this model in order to support decision in terms of decision diagrams. We exemplify the instantiation of the generic approach in the CORAS method for security risk analysis.Comment: 33 page

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing

    Implementing electronic scales to support standardized phenotypic data collection - the case of the Scale for the Assessment and Rating of Ataxia (SARA)

    Get PDF
    The main objective of this doctoral thesis was to facilitate the integration of the semantics required to automatically interpret collections of standardized clinical data. In order to address the objective, we combined the best performances from clinical archetypes, guidelines and ontologies for developing an electronic prototype for the Scale of the Assessment and Rating of Ataxia (SARA), broadly used in neurology. A scaled-down version of the Human Phenotype Ontology was automatically extracted and used as backbone to normalize the content of the SARA through clinical archetypes. The knowledge required to exploit reasoning on the SARA data was modeled as separate information-processing units interconnected via the defined archetypes. Based on this approach, we implemented a prototype named SARA Management System, to be used for both the assessment of cerebellar syndrome and the production of a clinical synopsis. For validation purposes, we used recorded SARA data from 28 anonymous subjects affected by SCA36. Our results reveal a substantial degree of agreement between the results achieved by the prototype and human experts, confirming that the combination of archetypes, ontologies and guidelines is a good solution to automate the extraction of relevant phenotypic knowledge from plain scores of rating scales

    Doctor of Philosophy

    Get PDF
    dissertationClinical research plays a vital role in producing knowledge valuable for understanding human disease and improving healthcare quality. Human subject protection is an obligation essential to the clinical research endeavor, much of which is governed by federal regulations and rules. Institutional Review Boards (IRBs) are responsible for overseeing human subject research to protect individuals from harm and to preserve their rights. Researchers are required to submit and maintain an IRB application, which is an important component in the clinical research process that can significantly affect the timeliness and ethical quality of the study. As clinical research has expanded in both volume and scope over recent years, IRBs are facing increasing challenges in providing efficient and effective oversight. The Clinical Research Informatics (CRI) domain has made significant efforts to support various aspects of clinical research through developing information systems and standards. However, information technology use by IRBs has not received much attention from the CRI community. This dissertation project analyzed over 100 IRB application systems currently used at major academic institutions in the United States. The varieties of system types and lack of standardized application forms across institutions are discussed in detail. The need for building an IRB domain analysis model is identified. . iv In this dissertation, I developed an IRB domain analysis model with a special focus on promoting interoperability among CRI systems to streamline the clinical research workflow. The model was evaluated by a comparison with five real-world IRB application systems. Finally, a prototype implementation of the model was demonstrated by the integration of an electronic IRB system with a health data query system. This dissertation project fills a gap in the research of information technology use for the IRB oversight domain. Adoption of the IRB domain analysis model has potential to enhance efficient and high-quality ethics oversight and to streamline the clinical research workflow

    Enhancing identification of opioid-involved health outcomes using National Hospital Care Survey data

    Get PDF
    Purpose: This report documents the development of the 2016 National Hospital Care Survey (NHCS) Enhanced Opioid Identification Algorithm, an algorithm that can be used to identify opioid-involved and opioid overdose hospital encounters. Additionally, the algorithm can be used to identify opioids and opioid antagonists that can be used to reverse opioid overdose (naloxone) and to treat opioid use disorder (naltrexone).Methods: The Enhanced Opioid Identification Algorithm improves the methodology for identifying opioids in hospital records using natural language processing (NLP), including machine learning techniques, and medical codes captured in the 2016 NHCS. Before the development of the Enhanced Opioid Identification Algorithm, opioid-involved hospital encounters were identified solely by coded diagnosis fields. Diagnosis codes provide limited information about context in the hospital encounters and can miss opioid-involved encounters that are embedded in free text data, like hospital clinical notes.Results: In the 2016 NHCS data, the enhanced algorithm identified 1,370,827 encounters involving the use of opioids and selected opioid antagonists. Approximately 20% of those encounters were identified exclusively by the NLP algorithm.Suggested citation: White DG, Adams NB, Brown AM, O\u2019Jiaku-Okorie A, Badwe R, Shaikh S, Adegboye A. Enhancing identification of opioid-involved health outcomes using National Hospital Care Survey data. National Center for Health Statistics. Vital Health Stat 2(188). 2021.CS32614720212021-10-17T00:00:00Z1119
    • …
    corecore