HIGH-DIMENSIONAL DATA ANALYSIS PROBLEMS IN INFECTIOUS DISEASE STUDIES

Abstract

Recent technological developments give researchers the opportunity to obtain large informative datasets when studying infectious disease. Such datasets are often high-dimensional, which presents challenges for classical multivariate analysis methods. It is critical to develop novel methods that can solve problems arising in infectious disease studies when the data is high-dimensional or has complex structure. In the first project, we focus on a Plasmodium vivax malaria infection study. A standard competing risks set-up requires both time-to-event and cause-of-failure to be fully observable for all subjects. However, in practice, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of Plasmodium vivax malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. Therefore, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to Plasmodium vivax infection data to classify recurrent infections of malaria. In the second project, we investigate data collected from a Chlamydia trachomatis genital tract infection study. Many biomedical studies collect data of mixed types of variables from multiple groups of subjects. Some of these studies aim to find the group-specific and the common variation among all these variables. Even though similar problems have been studied by some previous works, their methods mainly rely on the Pearson correlation, which cannot handle mixed data. To address this issue, we propose a Latent Mixed Gaussian Copula model that can quantify the correlations among binary, categorical, continuous, and truncated variables in a unified framework. We also provide a tool to decompose the variation into the group-specific and the common variation over multiple groups via solving a regularized M -estimation problem. We conduct extensive simulation studies to show the advantage of our proposed method over the Pearson correlation-based methods. We also demonstrate that by jointly solving the M-estimation problem over multiple groups, our method is better than decomposing the variation group-by-group. We apply our method to a Chlamydia trachomatis genital tract infection study to demonstrate how it can be used to discover informative biomarkers that differentiate patients. When performing variance decomposition for data collected from the Chlamydia trachomatis genital tract infection study, so far we only considered subjects with complete data for all data modalities and removed subjects with missing values. The fact that not all subjects have complete data from all data modalities results in a block-wise missing structure of the mixed type data. Simply removing subjects with block-wise missing values would lead to a great reduction in sample size and thereby losing valuable information. To utilize as much data as possible when the mixed type data has a block-wise missing structure, we propose to impute the missing values by the Latent Mixed Gaussian Copula model in the third project, where we perform imputation for block-wise missing values by the underlying correlations between fully observed and partially observed variables. The method proposed can be applied to multi-modal data with various data types. We performed extensive simulation experiments to examine the effect of true latent correlation, missing mechanism and dimensionality on the performance of our proposed method, and compare it with three other popular approach. Our method shows superior performance for imputing the mixed type data compared with the other methods under different scenarios. We also applied the method to the multi-modal data collected from a Chlamydia trachomatis genital tract infection study for imputation of missing endometrial infection status, endometrial diagnosis results, and truncated cytokine values.Doctor of Philosoph

    Similar works