Using Electronic Health Record Data for Public Health Surveillance of Diabetes among Young Adults

Abstract

There is growing interest in using electronic health records (EHRs) for the surveillance of chronic diseases because these data contain a wealth of timely, clinical information on large samples of individuals. However, as these data are collected for clinical purposes, they may be prone to a number of biases that could affect their utility for public health practitioners or researchers. First, these data represent convenience samples of individuals who are in-care. These samples may not be representative of target populations for public health surveillance activities (e.g., the general population within a city or jurisdiction) by factors like demographics or health status, which could affect the generalizability of results. Second, key variables, including demographics or disease status, are often susceptible to missing data or misclassification, which could affect estimation of disease prevalence or risk factor associations. The goal of this integrated learning experience (ILE) was to assess the application of EHR data for chronic disease surveillance, focusing on the potential impact of selection and information biases for the case study of diabetes. The first aim characterized the existing literature on defining diabetes status and type using EHRs from a population health perspective. The second aim externally validated diabetes prevalence estimates generated using EHR data from a large academic medical center in New York City (NYC) compared to traditional surveillance estimates from a local health survey. Various statistical methods, including raking, post-stratification, and multilevel regression with post-stratification, were applied to these real-world data and to simulated data to assess the ability to mitigate selection biases. Finally, the third aim externally validated EHR-based associations with potential diabetes risk factors (i.e., race/ethnicity and asthma) compared to estimates from national surveillance systems, including the Behavioral Risk Factor Surveillance System and National Health and Nutrition Examination Survey. Methods from the missing data and causal inference literature were then applied to assess the ability to control for misclassification of health outcomes in the EHR data. Results from the literature review demonstrated that while there was no gold standard for defining diabetes using EHR data, definitions that prioritized sensitivity over specificity may be preferable for population health purposes. Based on this review, a flexible definition that searched for evidence of diabetes across diagnoses, medications, and lab results was used for the second and third aims. In the second aim, using statistical methods to account for demographic differences between the EHR sample and general population helped to remediate biases observed in the crude diabetes prevalence estimates. However, simulation results demonstrated that these methods may be insufficient when data are lacking for variables that are strong predictors of selection into the EHR sample. In the third aim, applying missing data or causal inference methods to control for misclassification of health outcomes greatly reduced the strength of the association between asthma and diabetes status compared to naïve associations observed within the EHR sample, in alignment with observations from national health survey data. Overall, the findings of this ILE suggest that naïve EHR analyses may yield biased estimates of diabetes prevalence or measures of association, driven in part by differences in healthcare utilization patterns across the population. However, applying epidemiologic frameworks can help control for and, importantly, characterize residual biases in these estimates. Future research is needed to assess the potential for selection and information biases across a variety of health outcomes, geographies, and EHR data sources to further inform the utility of these data for population health surveillance

    Similar works