8 research outputs found
Recommended from our members
Learning and validating clinically meaningful phenotypes from electronic health data
The ever-growing adoption of electronic health records (EHR) to record patients' health journeys has resulted in vast amounts of heterogeneous, complex, and unwieldy information [Hripcsak and Albers, 2013]. Distilling this raw data into clinical insights presents great opportunities and challenges for the research and medical communities. One approach to this distillation is called computational phenotyping. Computational phenotyping is the process of extracting clinically relevant and interesting characteristics from a set of clinical documentation, such as that which is recorded in electronic health records (EHRs). Clinicians can use computational phenotyping, which can be viewed as a form of dimensionality reduction where a set of phenotypes form a latent space, to reason about populations, identify patients for randomized case-control studies, and extrapolate patient disease trajectories. In recent years, high-throughput computational approaches have made strides in extracting potentially clinically interesting phenotypes from data contained in EHR systems.
Tensor factorization methods have shown particular promise in deriving phenotypes. However, phenotyping methods via tensor factorization have the following weaknesses: 1) the extracted phenotypes can lack diversity, which makes them more difficult for clinicians to reason about and utilize in practice, 2) many of the tensor factorization methods are unsupervised and do not utilize side information that may be available about the population or about the relationships between the clinical characteristics in the data (e.g., diagnoses and medications), and 3) validating the clinical relevance of the extracted phenotypes requires domain training and expertise. This dissertation addresses all three of these limitations. First, we present tensor factorization methods that discover sparse and concise phenotypes in unsupervised, supervised, and semi-supervised settings. Second, via two tools we built, we show how to leverage domain expertise in the form of publicly available medical articles to evaluate the clinical validity of the discovered phenotypes. Third, we combine tensor factorization and the phenotype validation tools to guide the discovery process to more clinically relevant phenotypes.Computational Science, Engineering, and Mathematic
Recommended from our members
Phenotyping with Partially Labeled, Partially Observed Data
Identifying a group of individuals that share a common set of characteristics is a conceptually simple task, which is often difficult in practice. Such phenotyping problems emerge in various settings, including the analysis of clinical data. In this setting, phenotyping is often stymied by persistent data quality issues. These include a lack of reliable labels to indicate the presence of absence of characteristics of interest, and significant missingness in observed variables.
This dissertation introduces methods for learning phenotypes when the data contain missing values (partially observed) and labels are scarce (partially labeled). Aim 1 utilizes an unsupervised probabilistic graphical model to learn phenotypes from partially observed data. Aim 2 introduces a related semi-supervised probabilistic graphical model for learning phenotypes from partially labeled clinical data. Finally, Aim 3 describes a method for training deep generative models when the training data contain missing values. The algorithm is then applied in a semi-supervised setting where it accounts for partially labeled data as well
Recommended from our members
Patient Record Summarization Through Joint Phenotype Learning and Interactive Visualization
Complex patient are becoming more and more of a challenge to the health care system given the amount of care they require and the amount of documentation needed to keep track of their state of health and treatment. Record keeping using the EHR makes this easier but mounting amounts of patient data also means that clinicians are faced with information overload. Information overload has been shown to have deleterious effects on care, with increased safety concerns due to missed information. Patient record summarization has been a promising mitigator for information overload. Subsequently, a lot of research has been dedicated to record summarization since the introduction of EHRs. In this dissertation we examine whether unsupervised inference methods can derive patient problem-oriented summaries, that are robust to different patients. By grounding our experiments with HIV patients we leverage the data of a group of patients that are similar in that they share one common disease (HIV) but also exhibit complex histories of diverse comorbidities. Using a user-centered, iterative design process, we design an interactive, longitudinal patient record summarization tool, that leverages automated inferences about the patient's problems. We find that unsupervised, joint learning of problems using correlated topic models, adapted to handle the multiple data types (structured and unstructured) of the EHR, is successful in identifying the salient problems of complex patients. Utilizing interactive visualization that exposes inference results to users enables them to make sense of a patient's problems over time and to answer questions about a patient more accurately and faster than using the EHR alone
Recommended from our members
Computational Algorithms for Multi-omics and Electronic Health Records Data
Real world data have enhanced healthcare research, improving our understanding of disease progression, aiding in diagnosis, and enabling the development of personalized and targeted treatments. In recent years, multi-omics data and electronic health record (EHR) data have become increasingly available, providing researchers with a wealth of information to analyze. The use of machine learning methods with EHR and multi-omics data has emerged as a promising approach to extract valuable insights from these complex data sources. This dissertation focuses on the development of supervised and unsupervised learning methods, as well as their applications to EHR and multi-omics data, with a particular emphasis on early detection of clinical outcomes and identification of novel cancer subtypes.
The first part of the dissertation centers on developing a risk prediction tool using EHR data that enables disease early detection so that preventive treatments can be taken to better manage the disease. For this goal, we developed a similarity-based supervised learning method with two applications to predict end-stage kidney disease (ESKD) and aortic stenosis (AS). In the second part of the dissertation, we expanded our goal to a phenome-wide prediction task and developed a patient representation based deep learning method that is able to predict phenotypes across the phenome. Through a weighting scheme, this approach is conducting tailored disease phenotype prediction computationally efficiently with good prediction performance. In the final part of the dissertation, I shifted the focus with the goal to identify clinical meaningful novel disease subtypes with unsupervised learning methods using multi-omics data. We tackled this goal through integrating multiple patient graphs being generated from multiple omics data with molecular level features for an improved disease subtyping.
This dissertation has significantly contributed to the development of data-driven approaches to healthcare and biomedical research using EHR data and multi-omics data. The new methodologies developed with applications in multiple diseases using EHR and multi-omics data advanced our knowledge in disease diagnosis, vulnerable groups identification, and ultimately improve patient care