2,318 research outputs found

    Perturbation and scaled Cook's distance

    Get PDF
    Cook's distance [Technometrics 19 (1977) 15-18] is one of the most important diagnostic tools for detecting influential individual or subsets of observations in linear regression for cross-sectional data. However, for many complex data structures (e.g., longitudinal data), no rigorous approach has been developed to address a fundamental issue: deleting subsets with different numbers of observations introduces different degrees of perturbation to the current model fitted to the data, and the magnitude of Cook's distance is associated with the degree of the perturbation. The aim of this paper is to address this issue in general parametric models with complex data structures. We propose a new quantity for measuring the degree of the perturbation introduced by deleting a subset. We use stochastic ordering to quantify the stochastic relationship between the degree of the perturbation and the magnitude of Cook's distance. We develop several scaled Cook's distances to resolve the comparison of Cook's distance for different subset deletions. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of these scaled Cook's distances in a formal influence analysis.Comment: Published in at http://dx.doi.org/10.1214/12-AOS978 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Perturbation selection and influence measures in local influence analysis

    Get PDF
    Cook's [J. Roy. Statist. Soc. Ser. B 48 (1986) 133--169] local influence approach based on normal curvature is an important diagnostic tool for assessing local influence of minor perturbations to a statistical model. However, no rigorous approach has been developed to address two fundamental issues: the selection of an appropriate perturbation and the development of influence measures for objective functions at a point with a nonzero first derivative. The aim of this paper is to develop a differential--geometrical framework of a perturbation model (called the perturbation manifold) and utilize associated metric tensor and affine curvatures to resolve these issues. We will show that the metric tensor of the perturbation manifold provides important information about selecting an appropriate perturbation of a model. Moreover, we will introduce new influence measures that are applicable to objective functions at any point. Examples including linear regression models and linear mixed models are examined to demonstrate the effectiveness of using new influence measures for the identification of influential observations.Comment: Published in at http://dx.doi.org/10.1214/009053607000000343 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian Inference for Multivariate Survival Data with a Cure Fraction

    Get PDF
    AbstractWe develop Bayesian methods for right censored multivariate failure time data for populations with a cure fraction. We propose a new model, called the multivariate cure rate model, and provide a natural motivation and interpretation of it. To create the correlation structure between the failure times, we introduce a frailty term, which is assumed to have a positive stable distribution. The resulting correlation structure induced by the frailty term is quite appealing and leads to a nice characterization of the association between the failure times. Several novel properties of the model are derived. First, conditional on the frailty term, it is shown that the model has a proportional hazards structure with the covariates depending naturally on the cure rate. Second, we establish mathematical relationships between the marginal survivor functions of the multivariate cure rate model and the more standard mixture model for modelling cure rates. With the introduction of latent variables, we show that the new model is computationally appealing, and novel computational Markov chain Monte Carlo (MCMC) methods are developed to sample from the posterior distribution of the parameters. Specifically, we propose a modified version of the collapsed Gibbs technique (J. S. Liu, 1994, J. Amer. Statist. Assoc.89, 958–966) to sample from the posterior distribution. This development will lead to an efficient Gibbs sampling procedure, which would otherwise be extremely difficult. We characterize the propriety of the joint posterior distribution of the parameters using a class of noninformative improper priors. A real dataset from a melanoma clinical trial is presented to illustrate the methodology

    A generalized linear mixed model for longitudinal binary data with a marginal logit link function

    Get PDF
    Longitudinal studies of a binary outcome are common in the health, social, and behavioral sciences. In general, a feature of random effects logistic regression models for longitudinal binary data is that the marginal functional form, when integrated over the distribution of the random effects, is no longer of logistic form. Recently, Wang and Louis [Biometrika 90 (2003) 765--775] proposed a random intercept model in the clustered binary data setting where the marginal model has a logistic form. An acknowledged limitation of their model is that it allows only a single random effect that varies from cluster to cluster. In this paper we propose a modification of their model to handle longitudinal data, allowing separate, but correlated, random intercepts at each measurement occasion. The proposed model allows for a flexible correlation structure among the random intercepts, where the correlations can be interpreted in terms of Kendall's τ\tau. For example, the marginal correlations among the repeated binary outcomes can decline with increasing time separation, while the model retains the property of having matching conditional and marginal logit link functions. Finally, the proposed method is used to analyze data from a longitudinal study designed to monitor cardiac abnormalities in children born to HIV-infected women.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS390 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A statistical model to assess (allele-specific) associations between gene expression and epigenetic features using sequencing data

    Get PDF
    Sequencing techniques have been widely used to assess gene expression (i.e., RNA-seq) or the presence of epigenetic features (e.g., DNase-seq to identify open chromatin regions). In contrast to traditional microarray platforms, sequencing data are typically summarized in the form of discrete counts, and they are able to delineate allele-specific signals, which are not available from microarrays. The presence of epigenetic features are often associated with gene expression, both of which have been shown to be affected by DNA polymorphisms. However, joint models with the flexibility to assess interactions between gene expression, epigenetic features and DNA polymorphisms are currently lacking. In this paper, we develop a statistical model to assess the associations between gene expression and epigenetic features using sequencing data, while explicitly modeling the effects of DNA polymorphisms in either an allele-specific or nonallele-specific manner. We show that in doing so we provide the flexibility to detect associations between gene expression and epigenetic features, as well as conditional associations given DNA polymorphisms. We evaluate the performance of our method using simulations and apply our method to study the association between gene expression and the presence of DNase I Hypersensitive sites (DHSs) in HapMap individuals. Our model can be generalized to exploring the relationships between DNA polymorphisms and any two types of sequencing experiments, a useful feature as the variety of sequencing experiments continue to expand