2,871 research outputs found
Perturbation and scaled Cook's distance
Cook's distance [Technometrics 19 (1977) 15-18] is one of the most important
diagnostic tools for detecting influential individual or subsets of
observations in linear regression for cross-sectional data. However, for many
complex data structures (e.g., longitudinal data), no rigorous approach has
been developed to address a fundamental issue: deleting subsets with different
numbers of observations introduces different degrees of perturbation to the
current model fitted to the data, and the magnitude of Cook's distance is
associated with the degree of the perturbation. The aim of this paper is to
address this issue in general parametric models with complex data structures.
We propose a new quantity for measuring the degree of the perturbation
introduced by deleting a subset. We use stochastic ordering to quantify the
stochastic relationship between the degree of the perturbation and the
magnitude of Cook's distance. We develop several scaled Cook's distances to
resolve the comparison of Cook's distance for different subset deletions.
Theoretical and numerical examples are examined to highlight the broad spectrum
of applications of these scaled Cook's distances in a formal influence
analysis.Comment: Published in at http://dx.doi.org/10.1214/12-AOS978 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Perturbation selection and influence measures in local influence analysis
Cook's [J. Roy. Statist. Soc. Ser. B 48 (1986) 133--169] local influence
approach based on normal curvature is an important diagnostic tool for
assessing local influence of minor perturbations to a statistical model.
However, no rigorous approach has been developed to address two fundamental
issues: the selection of an appropriate perturbation and the development of
influence measures for objective functions at a point with a nonzero first
derivative. The aim of this paper is to develop a differential--geometrical
framework of a perturbation model (called the perturbation manifold) and
utilize associated metric tensor and affine curvatures to resolve these issues.
We will show that the metric tensor of the perturbation manifold provides
important information about selecting an appropriate perturbation of a model.
Moreover, we will introduce new influence measures that are applicable to
objective functions at any point. Examples including linear regression models
and linear mixed models are examined to demonstrate the effectiveness of using
new influence measures for the identification of influential observations.Comment: Published in at http://dx.doi.org/10.1214/009053607000000343 the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A generalized linear mixed model for longitudinal binary data with a marginal logit link function
Longitudinal studies of a binary outcome are common in the health, social,
and behavioral sciences. In general, a feature of random effects logistic
regression models for longitudinal binary data is that the marginal functional
form, when integrated over the distribution of the random effects, is no longer
of logistic form. Recently, Wang and Louis [Biometrika 90 (2003) 765--775]
proposed a random intercept model in the clustered binary data setting where
the marginal model has a logistic form. An acknowledged limitation of their
model is that it allows only a single random effect that varies from cluster to
cluster. In this paper we propose a modification of their model to handle
longitudinal data, allowing separate, but correlated, random intercepts at each
measurement occasion. The proposed model allows for a flexible correlation
structure among the random intercepts, where the correlations can be
interpreted in terms of Kendall's . For example, the marginal
correlations among the repeated binary outcomes can decline with increasing
time separation, while the model retains the property of having matching
conditional and marginal logit link functions. Finally, the proposed method is
used to analyze data from a longitudinal study designed to monitor cardiac
abnormalities in children born to HIV-infected women.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS390 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian Inference for Multivariate Survival Data with a Cure Fraction
AbstractWe develop Bayesian methods for right censored multivariate failure time data for populations with a cure fraction. We propose a new model, called the multivariate cure rate model, and provide a natural motivation and interpretation of it. To create the correlation structure between the failure times, we introduce a frailty term, which is assumed to have a positive stable distribution. The resulting correlation structure induced by the frailty term is quite appealing and leads to a nice characterization of the association between the failure times. Several novel properties of the model are derived. First, conditional on the frailty term, it is shown that the model has a proportional hazards structure with the covariates depending naturally on the cure rate. Second, we establish mathematical relationships between the marginal survivor functions of the multivariate cure rate model and the more standard mixture model for modelling cure rates. With the introduction of latent variables, we show that the new model is computationally appealing, and novel computational Markov chain Monte Carlo (MCMC) methods are developed to sample from the posterior distribution of the parameters. Specifically, we propose a modified version of the collapsed Gibbs technique (J. S. Liu, 1994, J. Amer. Statist. Assoc.89, 958–966) to sample from the posterior distribution. This development will lead to an efficient Gibbs sampling procedure, which would otherwise be extremely difficult. We characterize the propriety of the joint posterior distribution of the parameters using a class of noninformative improper priors. A real dataset from a melanoma clinical trial is presented to illustrate the methodology
A statistical model to assess (allele-specific) associations between gene expression and epigenetic features using sequencing data
Sequencing techniques have been widely used to assess gene expression (i.e., RNA-seq) or the presence of epigenetic features (e.g., DNase-seq to identify open chromatin regions). In contrast to traditional microarray platforms, sequencing data are typically summarized in the form of discrete counts, and they are able to delineate allele-specific signals, which are not available from microarrays. The presence of epigenetic features are often associated with gene expression, both of which have been shown to be affected by DNA polymorphisms. However, joint models with the flexibility to assess interactions between gene expression, epigenetic features and DNA polymorphisms are currently lacking. In this paper, we develop a statistical model to assess the associations between gene expression and epigenetic features using sequencing data, while explicitly modeling the effects of DNA polymorphisms in either an allele-specific or nonallele-specific manner. We show that in doing so we provide the flexibility to detect associations between gene expression and epigenetic features, as well as conditional associations given DNA polymorphisms. We evaluate the performance of our method using simulations and apply our method to study the association between gene expression and the presence of DNase I Hypersensitive sites (DHSs) in HapMap individuals. Our model can be generalized to exploring the relationships between DNA polymorphisms and any two types of sequencing experiments, a useful feature as the variety of sequencing experiments continue to expand
- …