To enhance the quality of medical services, Machine Learning (ML) techniques have been widely applied to model Electronic Health Records (EHRs). Nevertheless, clinical data present two significant challenges: data heterogeneity and complex causality, preventing the further application of ML models. The first challenge comes from the complexity of data structure. EHRs may consist of information from various sources presented in an unstructured format. To address this issue, one viable approach is to transform the raw EHR data into knowledge graphs (KGs) and utilize Graph Neural Networks (GNNs). However, given the imbalanced distribution and inherent heterogeneity of EHR data, the need for more robust GNNs tailored to the specifics of EHRs becomes imperative. In this thesis, we introduce two of my models, HSGNN and MHDP, custom-designed to handle specific EHR data and tackle this challenge.
The second challenge is rooted in the intricate latent structure of EHRs and the potential for algorithmic bias. Given the high cost associated with collecting EHR data, there may be inherent selection bias and a missing-not-at-random nature in EHRs. This can further cause algorithmic bias on ML models, especially when data volume is low. Furthermore, as demo- graphics in EHRs often act as confounders, deep learning models may exhibit confounding bias when working with observational data. To tackle these issues, we embrace causal inference theories, including using a deconfounder, to mitigate health disparities and enhance the gen- eralization capabilities of our models. In this paper, we introduce two of my models, PriMeD and FLMD, to achieve fairer predictions and more generalizable models