35 research outputs found

    A review of automatic phenotyping approaches using electronic health records

    Get PDF
    Electronic Health Records (EHR) are a rich repository of valuable clinical information that exist in primary and secondary care databases. In order to utilize EHRs for medical observational research a range of algorithms for automatically identifying individuals with a specific phenotype have been developed. This review summarizes and offers a critical evaluation of the literature relating to studies conducted into the development of EHR phenotyping systems. This review describes phenotyping systems and techniques based on structured and unstructured EHR data. Articles published on PubMed and Google scholar between 2013 and 2017 have been reviewed, using search terms derived from Medical Subject Headings (MeSH). The popularity of using Natural Language Processing (NLP) techniques in extracting features from narrative text has increased. This increased attention is due to the availability of open source NLP algorithms, combined with accuracy improvement. In this review, Concept extraction is the most popular NLP technique since it has been used by more than 50% of the reviewed papers to extract features from EHR. High-throughput phenotyping systems using unsupervised machine learning techniques have gained more popularity due to their ability to efficiently and automatically extract a phenotype with minimal human effort

    Deepr: A Convolutional Net for Medical Records

    Full text link
    Feature engineering remains a major bottleneck when creating predictive systems from electronic medical records. At present, an important missing element is detecting predictive regular clinical motifs from irregular episodic records. We present Deepr (short for Deep record), a new end-to-end deep learning system that learns to extract features from medical records and predicts future risk automatically. Deepr transforms a record into a sequence of discrete elements separated by coded time gaps and hospital transfers. On top of the sequence is a convolutional neural net that detects and combines predictive local clinical motifs to stratify the risk. Deepr permits transparent inspection and visualization of its inner working. We validate Deepr on hospital data to predict unplanned readmission after discharge. Deepr achieves superior accuracy compared to traditional techniques, detects meaningful clinical motifs, and uncovers the underlying structure of the disease and intervention space

    Discovery of Type 2 Diabetes Trajectories from Electronic Health Records

    Get PDF
    University of Minnesota Ph.D. dissertation. September 2020. Major: Health Informatics. Advisor: Gyorgy Simon. 1 computer file (PDF); xiii, 110 pages.Type 2 diabetes (T2D) is one of the fastest growing public health concerns in the United States. There were 30.3 million patients (9.4% of the US populations) suffering from diabetes in 2015. Diabetes, which is the seventh leading cause of death in the United States, is known to be a non-reversible (incurable) chronic disease, leading to severe complications, including chronic kidney disease, amputation, blindness, and various cardiac and vascular diseases. Early identification of patients at high risk is regarded as the most effective clinical tool to prevent or delay the development of diabetes, allowing patients to change their life style or to receive medication earlier. In turn, these interventions can help decrease the risk of diabetes by 30-60%. Many studies have been conducted aiming at the early identification of patients at high risk in the clinical settings. These studies typically only consider the patient's current state at the time of the assessment and do not fully utilize all available information such as patient's medical history. Past history is important. It has been shown that laboratory results and vital signs can differ between diabetic and non-diabetic patients as many as 15-20 years before the onset of diabetes. We have also shown in our study that the order in which patients develop diabetes-related comorbidities is predictive of their diabetes risk even after adjusting for the severity of the comorbidities. In this thesis, we develop multiple novel methods to discover T2D trajectories from Electronic Health Records (EHR). We define trajectory as an order of in which diseases developed. We aim to discover typical and atypical trajectories where typical trajectories represent predominant patterns of progressions and atypical trajectories refer to the rest of the trajectories. Revealing trajectories can allow us to divide patients into subpopulations that can uncover the underlying etiology of diabetes. More importantly, by assessing the risk correctly and by a better understanding of the heterogeneity of diabetes, we can provide better care. Since data collected from EHR poses several challenges to directly identify trajectories from EHR data, we devise four specific studies to address the challenges: First, we propose a new knowledge-driven representation for clinical data mining, second, we demonstrate a method for estimating the onset time of slow-onset diseases from intermittently observable laboratory results in the specific context of T2D, third, we present a method to infer trajectories, the sequence of comorbidities potentially leading up to a particular disease of interest, and finally, we propose a novel method to discover multiple trajectories from EHR data. The patterns we discovered from above four studies address a clinical issue, are clinically verifiable and are amenable to deployment in practice to improve the quality of individual patient care towards promoting public health in the United States

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets
    corecore