High-throughput Characterization of Diagnosis Disparities Across Conditions and Observational Datasets


Health disparities are preventable differences in health status and outcomes that adversely affect certain populations, and are generally attributable to unjust social or environmental influences. Mitigating health disparities is crucial toward preventing unnecessary and avoidable human suffering, and as such there has been a significant increase in health disparities research and funding. However, existing health disparities publications are geographically-constrained to specific institutions or populations, and often rely on disease definitions that cannot be easily applied elsewhere. While more recent publications have begun identifying differences utilizing larger datasets, for most diseases, differences in prevalence, age of onset, and time to diagnosis differences remain unstudied and unknown. This dissertation leverages informatics solutions built atop observational health datasets to enable high-throughput, reproducible assessments of disparities across subgroups, conditions, and datasets. In the first aim, this dissertation examines the literature to identify how health disparities in disease diagnosis are measured, computed, and reported. It then proposes an iterative approach for generating fair phenotype definitions that are more inclusive of subgroups of interest by utilizing algorithmic fairness measurements translated to epidemiological measures. In the second aim, this dissertation conducts large-scale characterizations of disease diagnosis patterns across subgroups (gender and race), conditions, and datasets. In particular, this dissertation conducts a prevalence-based assessment of disease diagnosis by computing prevalence differences, risk ratios, and age of onset differences across diseases and datasets. The dissertation then conducts a scalable assessment of time to diagnosis differences across 122 disease phenotypes. Finally, in the third aim, this dissertation moves from quantifying differences to identifying disparities in diagnosis. To do so, the dissertation applies a framework for causal fairness to decompose observed time to diagnosis differences into direct, indirect, and spurious effects. In conclusion, this dissertation's primary contributions are providing a systematic, scalable approach for identifying health differences and then quantifying health disparities at-scale across large-scale observational health datasets. The dissertation (1) proposes an iterative approach for systematically assessing the fairness of phenotypes used in observational health research, (2) systematically characterizes differential patterns of disease diagnosis across diseases and observational datasets, and (3) causally decomposes differences into quantifiable effects that suggest the presence of potential health disparities

Similar works

Full text


Columbia University Academic Commons


This paper was published in Columbia University Academic Commons.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.