Statistical modeling and statistical learning for disease prediction and classification

Abstract

This dissertation studies prediction and classification models for disease risk through semiparametric modeling and statistical learning. It consists of three parts. In the first part, we propose several survival models to analyze the Cooperative Huntington's Observational Research Trial (COHORT) study data accounting for the missing mutation status in relative participants (Kieburtz and Huntington Study Group, 1996a). Huntington's disease (HD) is a progressive neurodegenerative disorder caused by an expansion of cytosine-adenine-guanine (CAG) repeats at the IT15 gene. A CAG repeat number greater than or equal to 36 is defined as carrying the mutation and carriers will eventually show symptoms if not censored by other events. There is an inverse relationship between the age-at-onset of HD and the CAG repeat length; the greater the CAG expansion, the earlier the age-at-onset. Accurate estimation of age-at-onset based on CAG repeat length is important for genetic counseling and the design of clinical trials for HD. Participants in COHORT (denoted as probands) undergo a genetic test and their CAG repeat number is determined. Family members of the probands do not undergo the genetic test and their HD onset information is provided by probands. Several methods are proposed in the literature to model the age specific cumulative distribution function (CDF) of HD onset as a function of the CAG repeat length. However, none of the existing methods can be directly used to analyze COHORT proband and family data because family members' mutation status is not always known. In this work, we treat the presence or absence of an expanded CAG repeat in first-degree family members as missing data and use the expectation-maximization (EM) algorithm to carry out the maximum likelihood estimation of the COHORT proband and family data jointly. We perform simulation studies to examine finite sample performance of the proposed methods and apply these methods to estimate the CDF of HD age-at-onset from the COHORT proband and family combined data. Our results show a slightly lower estimated cumulative risk of HD with the combined data compared to using proband data alone. We then extend the approach to predict the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We model the time-specific effect through a nonparametric varying-coefficient function and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing HD from the mutation carriers in COHORT data and illustrate an inverse relationship between the cumulative risk of HD and the length of CAG repeats at the IT15 gene. In the second part of the dissertation, we develop methods to accurately predict whether pre-symptomatic individuals are at risk of a disease based on their various marker profiles, which offers an opportunity for early intervention well before definitive clinical diagnosis. For many diseases, existing clinical literature may suggest the risk of disease varies with some markers of biological and etiological importance, for example age. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of potential noise markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier implemented with support vector machine to construct effective age-dependent classification rules. The method adaptively adjusts age effect and separately tunes age and other markers to achieve optimal performance. We derive the asymptotic risk bound of the local smoothing support vector machine, and perform extensive simulation studies to compare with standard approaches. We apply the proposed method to two studies of premanifest HD subjects and controls to construct age-sensitive predictive scores for the risk of HD and risk of receiving HD diagnosis during the study period. In the third part of the dissertation, we develop a novel statistical learning method for longitudinal data. Predicting disease risk and progression is one of the main goals in many clinical studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. We develop a statistical learning method for longitudinal data by introducing subject-specific long-term and short-term latent effects through designed kernels to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of distinctive feature of data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzhemeier's Disease (Alzhemeier's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to predict the conversion from mild cognitive impairment to dementia, and show a substantial gain in performance while accounting for the longitudinal feature of data

    Similar works