7 research outputs found
Recommended from our members
Automatic Labeling of Special Diagnostic Mammography Views from Images and DICOM Headers.
Applying state-of-the-art machine learning techniques to medical images requires a thorough selection and normalization of input data. One of such steps in digital mammography screening for breast cancer is the labeling and removal of special diagnostic views, in which diagnostic tools or magnification are applied to assist in assessment of suspicious initial findings. As a common task in medical informatics is prediction of disease and its stage, these special diagnostic views, which are only enriched among the cohort of diseased cases, will bias machine learning disease predictions. In order to automate this process, here, we develop a machine learning pipeline that utilizes both DICOM headers and images to predict such views in an automatic manner, allowing for their removal and the generation of unbiased datasets. We achieve AUC of 99.72% in predicting special mammogram views when combining both types of models. Finally, we apply these models to clean up a dataset of about 772,000 images with expected sensitivity of 99.0%. The pipeline presented in this paper can be applied to other datasets to obtain high-quality image sets suitable to train algorithms for disease detection
Recommended from our members
Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods
DEEP LEARNING IN PERSONALIZED MEDICINE: APPLICATIONS IN PATIENT SIMILARITY, PROGNOSIS, AND OPTIMAL TREATMENT SELECTION
Two information technology revolutions are colliding in medicine. The first revolution has been the digitalization of health data, specifically Electronic Health Records (EHR). These records contain the details of who we are as patients, our ailments, treatments, and outcomes. Tragically, despite billions of dollars in investment from the US government, hardly any of this data is being utilized to better understand medicine or improve healthcare. This is largely because the data is voluminous, sparse, complex, and poorly formatted; making it unsuitable for traditional analytics methods. However the second revolution, modern Artificial Intelligence, specifically deep learning, provides tools, in the form of algorithms, to address exactly these problems. The primary difference between these modern algorithms and older ones is that the former are able to learn, more or less on their own, how to transform large complex data into a format that makes it easier to use and learn from. In this dissertation, I have developed methods to apply deep learning to digital health data. Doing so, I have shown that we can predict the future health of individual patients with highly complex diseases, produced approaches to understand and leverage what these complex models are learning, and provided a framework for how healthcare systems of the near future could automatically learn to improve care daily. For the first time in history, we are in a position to learn from the combined knowledge of tens of thousands of physicians and their experiences caring for hundreds of millions of patients. The potential transformations to healthcare are difficult to fully fathom, but certainly include safer, more powerful and efficient medicine, and a rapid speed up in new medical discoveries and treatments. Despite the promise, we must proceed carefully, balancing the great need to collectively use our data for better medicine with the individual right to privacy
Developing the Total Health Profile, a Generalizable Unified Set of Multimorbidity Risk Scores Derived From Machine Learning for Broad Patient Populations: Retrospective Cohort Study
BackgroundMultimorbidity clinical risk scores allow clinicians to quickly assess their patients' health for decision making, often for recommendation to care management programs. However, these scores are limited by several issues: existing multimorbidity scores (1) are generally limited to one data group (eg, diagnoses, labs) and may be missing vital information, (2) are usually limited to specific demographic groups (eg, age), and (3) do not formally provide any granularity in the form of more nuanced multimorbidity risk scores to direct clinician attention.
ObjectiveUsing diagnosis, lab, prescription, procedure, and demographic data from electronic health records (EHRs), we developed a physiologically diverse and generalizable set of multimorbidity risk scores.
MethodsUsing EHR data from a nationwide cohort of patients, we developed the total health profile, a set of six integrated risk scores reflecting five distinct organ systems and overall health. We selected the occurrence of an inpatient hospital visitation over a 2-year follow-up window, attributable to specific organ systems, as our risk endpoint. Using a physician-curated set of features, we trained six machine learning models on 794,294 patients to predict the calibrated probability of the aforementioned endpoint, producing risk scores for heart, lung, neuro, kidney, and digestive functions and a sixth score for combined risk. We evaluated the scores using a held-out test cohort of 198,574 patients.
ResultsStudy patients closely matched national census averages, with a median age of 41 years, a median income of $66,829, and racial averages by zip code of 73.8% White, 5.9% Asian, and 11.9% African American. All models were well calibrated and demonstrated strong performance with areas under the receiver operating curve (AUROCs) of 0.83 for the total health score (THS), 0.89 for heart, 0.86 for lung, 0.84 for neuro, 0.90 for kidney, and 0.83 for digestive functions. There was consistent performance of this scoring system across sexes, diverse patient ages, and zip code income levels. Each model learned to generate predictions by focusing on appropriate clinically relevant patient features, such as heart-related hospitalizations and chronic hypertension diagnosis for the heart model. The THS outperformed the other commonly used multimorbidity scoring systems, specifically the Charlson Comorbidity Index (CCI) and the Elixhauser Comorbidity Index (ECI) overall (AUROCs: THS=0.823, CCI=0.735, ECI=0.649) as well as for every age, sex, and income bracket. Performance improvements were most pronounced for middle-aged and lower-income subgroups. Ablation tests using only diagnosis, prescription, social determinants of health, and lab feature groups, while retaining procedure-related features, showed that the combination of feature groups has the best predictive performance, though only marginally better than the diagnosis-only model on at-risk groups.
ConclusionsMassive retrospective EHR data sets have made it possible to use machine learning to build practical multimorbidity risk scores that are highly predictive, personalizable, intuitive to explain, and generalizable across diverse patient populations
Table_1_Victims of human trafficking and exploitation in the healthcare system: a retrospective study using a large multi-state dataset and ICD-10 codes.XLSX
Trafficking and exploitation for sex or labor affects millions of persons worldwide. To improve healthcare for these patients, in late 2018 new ICD-10 medical diagnosis codes were implemented in the US. These 13 codes include diagnosis of adult and child sexual exploitation, adult and child labor exploitation, and history of exploitation. Here we report on a database search of a large US health insurer that contained approximately 47.1 million patients and 0.9 million provider organizations, not limited to large medical systems. We reported on any diagnosis with the new codes between 2018-09-01 and 2022-09-01. The dataset was found to contain 5,262 instances of the ICD-10 codes. Regression analysis of the codes found a 5.8% increase in the uptake of these codes per year, representing a decline relative to 6.7% annual increase in the data. The codes were used by 1,810 different providers (0.19% of total) for 2,793 patients. Of the patients, 1,248 were recently trafficked, while the remainder had a personal history of exploitation. Of the recent cases, 86% experienced sexual exploitation, 14% labor exploitation and 0.8% both types. These patients were predominantly female (83%) with a median age of 20 (interquartile range: 15–35). The patients were characterized by persistently high prevalence of mental health conditions (including anxiety: 21%, post-traumatic stress disorder: 20%, major depression: 18%), sexually-transmitted infections, and high utilization of the emergency department (ED). The patients’ first report of trafficking occurred most often outside of a hospital or emergency setting (55%), primarily during office and psychiatric visits.</p