    Haplotype-Based Pharmacogenetic Analysis for Longitudinal Quantitative Traits in the Presence of Dropout

    We propose a variety of methods based on the generalized estimation equations to address the issues encountered in haplotype-based pharmacogenetic analysis, including analysis of longitudinal data with outcome-dependent dropouts, and evaluation of the high-dimensional haplotype and haplotype-drug interaction effects in an overall manner. We use the inverse probability weights to handle the outcome-dependent dropouts under the missing-at-random assumption, and incorporate the weighted L1-penalty to select important main and interaction effects with high dimensionality. The proposed methods are easy to implement, computationally efficient, and provide an optimal balance between false positives and false negatives in detecting genetic effects

    Statistical Methods for Multi-State Analysis of Incomplete Longitudinal Data

    Analyses of longitudinal categorical data are typically based on semiparametric models in which covariate effects are expressed on marginal probabilities and estimation is carried out based on generalized estimating equations (GEE). Methods based on GEE are motivated in part by the lack of tractable models for clustered categorical data. However such marginal methods may not yield fully efficient estimates, nor consistent estimates when missing data are present. In the first part of the thesis I develop a Markov model for the analysis of longitudinal categorical data which facilitates modeling marginal and conditional structures. A likelihood formulation is employed for inference, so the resulting estimators enjoy properties such as optimal efficiency and consistency, and remain consistent when data are missing at random. Simulation studies demonstrate that the proposed method performs well under a variety of situations. Application to data from a smoking prevention study illustrates the utility of the model and interpretation of covariate effects. Incomplete data often arise in many areas of research in practice. This phenomenon is common in longitudinal data on disease history of subjects. Progressive models provide a convenient framework for characterizing disease processes which arise, for example, when the state represents the degree of the irreversible damage incurred by the subject. Problems arise if the mechanism leading to the missing data is related to the response process. A naive analysis might lead to biased results and invalid inferences. The second part of this thesis begins with an investigation of progressive multi-state models for longitudinal studies with incomplete observations. Maximum likelihood estimation is carried out based on an EM algorithm, and variance estimation is provided using Louis method. In general, the maximum likelihood estimates are valid when the missing data mechanism is missing completely at random or missing at random. Here we provide likelihood based method in that the parameters are identifiable no matter what the missing data mechanism. Simulation studies demonstrate that the proposed method works well under a variety of situations. In practice, we often face data with missing values in both the response and the covariates, and sometimes there is some association between the missingness of the response and the covariate. The proper analysis of this type of data requires taking this correlation into consideration. The impact of attrition in longitudinal studies depends on the correlation between the missing response and missing covariate. Ignoring such correlation can bias the statistical inference. We have studied the proper method that incorporates the association between the missingness of the response and missing covariate through the use of inverse probability weighted generalized estimating equations. The simulation illustrates that the proposed method yields a consistent estimator, while the method that ignores the association yields an inconsistent estimator. Many analyses for longitudinal incomplete data focus on studying the impact of covariates on the mean responses. However, little attention has been directed to address the impact of missing covariates on the association parameters in clustered longitudinal studies. The last part of this thesis mainly addresses this problem. Weighted first and second order estimating equations are constructed to obtain consistent estimates of mean and association parameters

    Estimating Risk-adjusted Process Performance with a Bias/Variance Trade-off

    Decision makers responsible for managing the performance of a process commonly base their decisions on an estimate of present performance, a comparison of estimates across multiple streams, and the trend in performance estimates over time. Their decisions are well-informed when the risk-adjusted estimates of the performance measure (or parameter) are accurate and precise. The work is motivated by three applications to estimate a parameter at the present time from a stream of data where the parameter drifts slowly in an unpredictable way over time. It is common practice to estimate its value using either present time data only or using present and historical data. When sample sizes by time period are small, an estimate based on present time data is imprecise and can lead to uninformative or misleading conclusions. We can choose to estimate the parameter using an aggregate of historical and present time data but this choice trades more bias for less variability when the parameter is drifting over time. We propose to regulate the bias/variance trade-off using estimating equations that down-weight past data. We derive approximations for the variance of the estimator and the distribution of a hypothesis test statistic involving the estimator through known asymptotic properties of the estimating functions. We study the proposed approach relative to current practices with real or realistic data from each application. We offer simulations and analytic examples to generalize the comparisons and validate the approximations. We explore considerations related to implementing the proposed approach. We suggest future work to extend the applicability of this work

    Longitudinal Data Analysis with Composite Likelihood Methods

    Longitudinal data arise commonly in many fields including public health studies and survey sampling. Valid inference methods for longitudinal data are of great importance in scientific researches. In longitudinal studies, data collection are often designed to follow all the interested information on individuals at scheduled times. The analysis in longitudinal studies usually focuses on how the data change over time and how they are associated with certain risk factors or covariates. Various statistical models and methods have been developed over the past few decades. However, these methods could become invalid when data possess additional features. First of all, incompleteness of data presents considerable complications to standard modeling and inference methods. Although we hope each individual completes all of the scheduled measurements without any absence, missing observations occur commonly in longitudinal studies. It has been documented that biased results could arise if such a feature is not properly accounted for in the analysis. There has been a large body of methods in the literature on handling missingness arising either from response components or covariate variables, but relatively little attention has been directed to addressing missingness in both response and covariate variables simultaneously. Important reasons for the sparsity of the research on this topic may be attributed to substantially increased complexity of modeling and computational difficulties. In Chapter 2 and Chapter 3 of the thesis, I develop methods to handle incomplete longitudinal data using the pairwise likelihood formulation. The proposed methods can handle longitudinal data with missing observations in both response and covariate variables. A unified framework is invoked to accommodate various types of missing data patterns. The performance of the proposed methods is carefully assessed under a variety of circumstances. In particular, issues on efficiency and robustness are investigated. Longitudinal survey data from the National Population Health Study are analyzed with the proposed methods. The other difficulty in longitudinal data is model selection. Incorporating a large number of irrelevant covariates to the model may result in computation, interpretation and prediction difficulties, thus selecting parsimonious models are typically desirable. In particular, the penalized likelihood method is commonly employed for this purpose. However, when we apply the penalized likelihood approach in longitudinal studies, it may involve high dimensional integrals which are computationally expensive. We propose an alternative method using the composite likelihood formulation. Formulation of composite likelihood requires only a partial structure of the correlated data such as marginal or pairwise distributions. This strategy shows modeling tractability and computational cheapness in model selection. Therefore, in Chapter 4 of this thesis, I propose a composite likelihood approach with penalized function to handle the model selection issue. In practice, we often face the model selection problem not only from choosing proper covariates for regression predictor, but also from the component of random effects. Furthermore, the specification of random effects distribution could be crucial to maintain the validity of statistical inference. Thus, the discussion on selecting both covariates and random effects as well as misspecification of random effects are also included in Chapter 4. Chapter 5 of this thesis mainly addresses the joint features of missingness and model selection. I propose a specific composite likelihood method to handle this issue. A typical advantage of the approach is that the inference procedure does not involve explicit missing process assumptions and nuisance parameters estimation

    Analysis of Correlated Data with Measurement Error in Responses or Covariates

    Correlated data frequently arise from epidemiological studies, especially familial and longitudinal studies. Longitudinal design has been used by researchers to investigate the changes of certain characteristics over time at the individual level as well as how potential factors influence the changes. Familial studies are often designed to investigate the dependence of health conditions among family members. Various models have been developed for this type of multivariate data, and a wide variety of estimation techniques have been proposed. However, data collected from observational studies are often far from perfect, as measurement error may arise from different sources such as defective measuring systems, diagnostic tests without gold references, and self-reports. Under such scenarios only rough surrogate variables are measured. Measurement error in covariates in various regression models has been discussed extensively in the literature. It is well known that naive approaches ignoring covariate error often lead to inconsistent estimators for model parameters. In this thesis, we develop inferential procedures for analyzing correlated data with response measurement error. We consider three scenarios: (i) likelihood-based inferences for generalized linear mixed models when the continuous response is subject to nonlinear measurement errors; (ii) estimating equations methods for binary responses with misclassifications; and (iii) estimating equations methods for ordinal responses when the response variable and categorical/ordinal covariates are subject to misclassifications. The first problem arises when the continuous response variable is difficult to measure. When the true response is defined as the long-term average of measurements, a single measurement is considered as an error-contaminated surrogate. We focus on generalized linear mixed models with nonlinear response error and study the induced bias in naive estimates. We propose likelihood-based methods that can yield consistent and efficient estimators for both fixed-effects and variance parameters. Results of simulation studies and analysis of a data set from the Framingham Heart Study are presented. Marginal models have been widely used for correlated binary, categorical, and ordinal data. The regression parameters characterize the marginal mean of a single outcome, without conditioning on other outcomes or unobserved random effects. The generalized estimating equations (GEE) approach, introduced by Liang and Zeger (1986), only models the first two moments of the responses with associations being treated as nuisance characteristics. For some clustered studies especially familial studies, however, the association structure may be of scientific interest. With binary data Prentice (1988) proposed additional estimating equations that allow one to model pairwise correlations. We consider marginal models for correlated binary data with misclassified responses. We develop “corrected” estimating equations approaches that can yield consistent estimators for both mean and association parameters. The idea is related to Nakamura (1990) that is originally developed for correcting bias induced by additive covariate measurement error under generalized linear models. Our approaches can also handle correlated misclassifications rather than a simple misclassification process as considered by Neuhaus (2002) for clustered binary data under generalized linear mixed models. We extend our methods and further develop marginal approaches for analysis of longitudinal ordinal data with misclassification in both responses and categorical covariates. Simulation studies show that our proposed methods perform very well under a variety of scenarios. Results from application of the proposed methods to real data are presented. Measurement error can be coupled with many other features in the data, e.g., complex survey designs, that can complicate inferential procedures. We explore combining survey weights and misclassification in ordinal covariates in logistic regression analyses. We propose an approach that incorporates survey weights into estimating equations to yield design-based unbiased estimators. In the final part of the thesis we outline some directions for future work, such as transition models and semiparametric models for longitudinal data with both incomplete observations and measurement error. Missing data is another common feature in applications. Developing novel statistical techniques for dealing with both missing data and measurement error can be beneficial