255 research outputs found
Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record
The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning
The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/1/sim8445_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/2/sim8445.pd
Prospect patents, data markets, and the commons in data-driven medicine : openness and the political economy of intellectual property rights
Scholars who point to political influences and the regulatory function of patent courts in the USA have long questioned the courtsâ subjective interpretation of what âthingsâ can be claimed as inventions. The present article sheds light on a different but related facet: the role of the courts in regulating knowledge production. I argue that the recent cases decided by the US Supreme Court and the Federal Circuit, which made diagnostics and software very difficult to patent and which attracted criticism for a wealth of different reasons, are fine case studies of the current debate over the proper role of the state in regulating the marketplace and knowledge production in the emerging information economy. The article explains that these patents are prospect patents that may be used by a monopolist to collect data that everybody else needs in order to compete effectively. As such, they raise familiar concerns about failure of coordination emerging as a result of a monopolist controlling a resource such as datasets that others need and cannot replicate. In effect, the courts regulated the market, primarily focusing on ensuring the free flow of data in the emerging marketplace very much in the spirit of the âfree the dataâ language in various policy initiatives, yet at the same time with an eye to boost downstream innovation. In doing so, these decisions essentially endorse practices of personal information processing which constitute a new type of public domain: a source of raw materials which are there for the taking and which have become most important inputs to commercial activity. From this vantage point of view, the legal interpretation of the private and the shared legitimizes a model of data extraction from individuals, the raw material of information capitalism, that will fuel the next generation of data-intensive therapeutics in the field of data-driven medicine
Recommended from our members
Generating Reliable and Responsive Observational Evidence: Reducing Pre-analysis Bias
A growing body of evidence generated from observational data has demonstrated the potential to influence decision-making and improve patient outcomes. For observational evidence to be actionable, however, it must be generated reliably and in a timely manner. Large distributed observational data networks enable research on diverse patient populations at scale and develop new sound methods to improve reproducibility and robustness of real-world evidence. Nevertheless, the problems of generalizability, portability and scalability persist and compound. As analytical methods only partially address bias, reliable observational research (especially in networks) must address the bias at the design stage (i.e., pre-analysis bias) including the strategies for identifying patients of interest and defining comparators.
This thesis synthesizes and enumerates a set of challenges to addressing pre-analysis bias in observational studies and presents mixed-methods approaches and informatics solutions for overcoming a number of those obstacles. We develop frameworks, methods and tools for scalable and reliable phenotyping including data source granularity estimation, comprehensive concept set selection, index date specification, and structured data-based patient review for phenotype evaluation. We cover the research on potential bias in the unexposed comparator definition including systematic background rates estimation and interpretation, and definition and evaluation of the unexposed comparator.
We propose that the use of standardized approaches and methods as described in this thesis not only improves reliability but also increases responsiveness of observational evidence. To test this hypothesis, we designed and piloted a Data Consult Service - a service that generates new on-demand evidence at the bedside. We demonstrate that it is feasible to generate reliable evidence to address cliniciansâ information needs in a robust and timely fashion and provide our analysis of the current limitations and future steps needed to scale such a service
Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions
Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting
clinical research, but they become an obstacle when they are not readily available.
Developing new definitions manually requires expert involvement that is labor-intensive,
time-consuming, and unscalable. Moreover, automated approaches rely mostly on
electronic health recordsâ data that suffer from bias, confounding, and incompleteness.
Limited efforts established in utilizing text-mining and data-driven approaches to automate
extraction and literature-based knowledge discovery of phenotyping definitions and to
support their scalability. In this dissertation, we proposed a text-mining pipeline combining
rule-based and machine-learning methods to automate retrieval, classification, and
extraction of phenotyping definitionsâ information from literature. To achieve this, we first
developed an annotation guideline with ten dimensions to annotate sentences with evidence
of phenotyping definitions' modalities, such as phenotypes and laboratories. Two
annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text
observational studiesâ methods sections (n=86). Percent and Kappa statistics showed high
inter-annotator agreement on sentence-level annotations. Second, we constructed two
validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level.
We applied the abstract-level classifier on a large-scale biomedical literature of over
20 million abstracts published between 1975 and 2018 to classify positive abstracts
(n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from
their methods sections and used the full-text sentence-level classifier to extract positive
sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the
positively classified sentences. Lexica-based methods were used to recognize medical
concepts in these sentences (n=19,423). Co-occurrence and association methods were used
to identify and rank phenotype candidates that are associated with a phenotype of interest.
We derived 12,616,465 associations from our large-scale corpus. Our literature-based
associations and large-scale corpus contribute in building new data-driven phenotyping
definitions and expanding existing definitions with minimal expert involvement
Advancing Precision Medicine: Unveiling Disease Trajectories, Decoding Biomarkers, and Tailoring Individual Treatments
Chronic diseases are not only prevalent but also exert a considerable strain on the healthcare system, individuals, and communities. Nearly half of all Americans suffer from at least one chronic disease, which is still growing. The development of machine learning has brought new directions to chronic disease analysis. Many data scientists have devoted themselves to understanding how a disease progresses over time, which can lead to better patient management, identification of disease stages, and targeted interventions. However, due to the slow progression of chronic disease, symptoms are barely noticed until the disease is advanced, challenging early detection. Meanwhile, chronic diseases often have diverse underlying causes and can manifest differently among patients. Besides the external factors, the development of chronic disease is also influenced by internal signals. The DNA sequence-level differences have been proven responsible for constant predisposition to chronic diseases. Given these challenges, data must be analyzed at various scales, ranging from single nucleotide polymorphisms (SNPs) to individuals and populations, to better understand disease mechanisms and provide precision medicine. Therefore, this research aimed to develop an automated pipeline from building predictive models and estimating individual treatment effects based on the structured data of general electronic health records (EHRs) to identifying genetic variations (e.g., SNPs) associated with diseases to unravel the genetic underpinnings of chronic diseases. First, we used structured EHRs to uncover chronic disease progression patterns and assess the dynamic contribution of clinical features. In this step, we employed causal inference methods (constraint-based and functional causal models) for feature selection and utilized Markov chains, attention long short-term memory (LSTM), and Gaussian process (GP). SHapley Additive exPlanations (SHAPs) and local interpretable model-agnostic explanations (LIMEs) further extended the work to identify important clinical features. Next, I developed a novel counterfactual-based method to predict individual treatment effects (ITE) from observational data. To discern a âbalancedâ representation so that treated and control distributions look similar, we disentangled the doctorâs preference from the covariance and rebuilt the representation of the treated and control groups. We use integral probability metrics to measure distances between distributions. The expected ITE estimation error of a representation was the sum of the standard generalization error of that representation and the distance between the distributions induced. Finally, we performed genome-wide association studies (GWAS) based on the stage information we extracted from our unsupervised disease progression model to identify the biomarkers and explore the genetic correction between the disease and its phenotypes
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
- âŚ