298 research outputs found
Exploiting Gene-Environment Independence for Analysis of Case-Control Studies: An Empirical Bayes Approach to Trade Off between Bias and Efficiency
Standard prospective logistic regression analysis of case-control data often leads to very imprecise estimates of gene-environment interactions due to small numbers of cases or controls in cells of crossing genotype and exposure. In contrast, modern ``retrospective\u27\u27 methods, including the celebrated ``case-only\u27\u27 approach, can estimate the interaction parameters much more precisely, but they can be seriously biased when the underlying assumption of gene-environment independence is violated. In this article, we propose a novel approach to analyze case-control data that can relax the gene-environment independence assumption using an empirical Bayes (EB) framework. In the special case, involving a binary gene and a binary exposure, the framework leads to an estimator of the odds-ratio interaction parameter in a simple closed form that corresponds to an weighted average of the standard case-only and case-control estimators. We also describe a general approach for deriving the EB estimator and its variances within the retrospective maximum-likelihood framework developed by Chatterjee and Carroll (2005). We conduct simulation studies to investigate the mean-squared-error of the proposed estimator in both fixed and random parameter settings. We also illustrate the application of this methodology using two real data examples. Both simulated and real data examples suggest that the proposed estimator strikes an excellent balance between bias and efficiency depending on the true nature of the gene-environment association and the sample size for a given study
Exploiting Gene-Environment Independence for Analysis of Case-Control Studies: An Empirical Bayes Approach to Trade Off between Bias and Efficiency
Standard prospective logistic regression analysis of case-control data often leads to very imprecise estimates of gene-environment interactions due to small numbers of cases or controls in cells of crossing genotype and exposure. In contrast, under the assumption of gene-environment independence, modern “retrospective” methods, including the “case-only” approach, can estimate the interaction parameters much more precisely, but they can be seriously biased when the underlying assumption of gene-environment independence is violated. In this article, we propose a novel approach to analyze case-control data that can relax the gene-environment independence assumption using an empirical Bayes framework. In the special case, involving a binary gene and a binary exposure, the framework leads to an estimator of the odds-ratio interaction parameter in a simple closed form that corresponds to an weighted average of the standard case-only and case-control estimators. We also describe a general approach for deriving the empirical Bayes estimator and its variance within the retrospective maximum-likelihood framework developed by Chatterjee and Carroll (2005). We conduct simulation studies to investigate the mean-squared-error of the proposed estimator in both fixed and random parameter settings. We also illustrate the application of this methodology using two real data examples. Both simulated and real data examples suggest that the proposed estimator strikes an excellent balance between bias and efficiency depending on the true nature of the gene-environment association and the sample size for a given study
A note on bias due to fitting prospective multivariate generalized linear models to categorical outcomes ignoring retrospective sampling schemes
AbstractOutcome-dependent sampling designs are commonly used in economics, market research and epidemiological studies. Case-control sampling design is a classic example of outcome-dependent sampling, where exposure information is collected on subjects conditional on their disease status. In many situations, the outcome under consideration may have multiple categories instead of a simple dichotomization. For example, in a case-control study, there may be disease sub-classification among the “cases” based on progression of the disease, or in terms of other histological and morphological characteristics of the disease. In this note, we investigate the issue of fitting prospective multivariate generalized linear models to such multiple-category outcome data, ignoring the retrospective nature of the sampling design. We first provide a set of necessary and sufficient conditions for the link functions that will allow for equivalence of prospective and retrospective inference for the parameters of interest. We show that for categorical outcomes, prospective–retrospective equivalence does not hold beyond the generalized multinomial logit link. We then derive an approximate expression for the bias incurred when link functions outside this class are used. Most popular models for ordinal response fall outside the multiplicative intercept class and one should be cautious while performing a naive prospective analysis of such data as the bias could be substantial. We illustrate the extent of bias through a real data example, based on the ongoing Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial by the National Cancer Institute. The simulations based on the real study illustrate that the bias approximations work well in practice
A note on bias due to fitting prospective multivariate generalized linear models to categorical outcomes ignoring retrospective sampling schemes
Outcome dependent sampling designs are commonly used in economics, market research and epidemiological studies. Case-control sampling design is a classic example of outcome dependent sampling, where exposure information is collected on subjects conditional on their disease status. In many situations, the outcome under consideration may have multiple categories instead of a simple dichotomization. For example, in a case-control study, there may be disease sub-classification among the “cases” based on progression of the disease, or in terms of other histological and morphological characteristics of the disease. In this note, we investigate the issue of fitting prospective multivariate generalized linear models to such multiple-category outcome data, ignoring the retrospective nature of the sampling design. We first provide a set of necessary and sufficient conditions for the link functions that will allow for equivalence of prospective and retrospective inference for the parameters of interest. We show that for categorical outcomes, prospective-retrospective equivalence does not hold beyond the generalized multinomial logit link. We then derive an approximate expression for the bias incurred when link functions outside this class are used. We illustrate the extent of bias through a real data example, based on the ongoing Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial by the National Cancer Institute
Exploiting Gene-Environment Independence for Analysis of Case–Control Studies: An Empirical Bayes-Type Shrinkage Estimator to Trade-Off between Bias and Efficiency
Standard prospective logistic regression analysis of case–control data often leads to very imprecise estimates of gene-environment interactions due to small numbers of cases or controls in cells of crossing genotype and exposure. In contrast, under the assumption of gene-environment independence, modern “retrospective” methods, including the “case-only” approach, can estimate the interaction parameters much more precisely, but they can be seriously biased when the underlying assumption of gene-environment independence is violated. In this article, we propose a novel empirical Bayes-type shrinkage estimator to analyze case–control data that can relax the gene-environment independence assumption in a data-adaptive fashion. In the special case, involving a binary gene and a binary exposure, the method leads to an estimator of the interaction log odds ratio parameter in a simple closed form that corresponds to an weighted average of the standard case-only and case–control estimators. We also describe a general approach for deriving the new shrinkage estimator and its variance within the retrospective maximum-likelihood framework developed by Chatterjee and Carroll (2005, Biometrika 92, 399–418). Both simulated and real data examples suggest that the proposed estimator strikes a balance between bias and efficiency depending on the true nature of the gene-environment association and the sample size for a given study.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65511/1/j.1541-0420.2007.00953.x.pd
Bayesian semiparametric analysis for two-phase studies of gene-environment interaction
The two-phase sampling design is a cost-efficient way of collecting expensive
covariate information on a judiciously selected subsample. It is natural to
apply such a strategy for collecting genetic data in a subsample enriched for
exposure to environmental factors for gene-environment interaction (G x E)
analysis. In this paper, we consider two-phase studies of G x E interaction
where phase I data are available on exposure, covariates and disease status.
Stratified sampling is done to prioritize individuals for genotyping at phase
II conditional on disease and exposure. We consider a Bayesian analysis based
on the joint retrospective likelihood of phases I and II data. We address
several important statistical issues: (i) we consider a model with multiple
genes, environmental factors and their pairwise interactions. We employ a
Bayesian variable selection algorithm to reduce the dimensionality of this
potentially high-dimensional model; (ii) we use the assumption of gene-gene and
gene-environment independence to trade off between bias and efficiency for
estimating the interaction parameters through use of hierarchical priors
reflecting this assumption; (iii) we posit a flexible model for the joint
distribution of the phase I categorical variables using the nonparametric Bayes
construction of Dunson and Xing [J. Amer. Statist. Assoc. 104 (2009)
1042-1051].Comment: Published in at http://dx.doi.org/10.1214/12-AOAS599 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Design Issues for Generalized Linear Models: A Review
Generalized linear models (GLMs) have been used quite effectively in the
modeling of a mean response under nonstandard conditions, where discrete as
well as continuous data distributions can be accommodated. The choice of design
for a GLM is a very important task in the development and building of an
adequate model. However, one major problem that handicaps the construction of a
GLM design is its dependence on the unknown parameters of the fitted model.
Several approaches have been proposed in the past 25 years to solve this
problem. These approaches, however, have provided only partial solutions that
apply in only some special cases, and the problem, in general, remains largely
unresolved. The purpose of this article is to focus attention on the
aforementioned dependence problem. We provide a survey of various existing
techniques dealing with the dependence problem. This survey includes
discussions concerning locally optimal designs, sequential designs, Bayesian
designs and the quantile dispersion graph approach for comparing designs for
GLMs.Comment: Published at http://dx.doi.org/10.1214/088342306000000105 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Semiparametric Bayesian Analysis of Case–Control Data under Conditional Gene-Environment Independence
In case–control studies of gene-environment association with disease, when genetic and environmental exposures can be assumed to be independent in the underlying population, one may exploit the independence in order to derive more efficient estimation techniques than the traditional logistic regression analysis ( Chatterjee and Carroll, 2005 , Biometrika 92, 399–418). However, covariates that stratify the population, such as age, ethnicity and alike, could potentially lead to nonindependence. In this article, we provide a novel semiparametric Bayesian approach to model stratification effects under the assumption of gene-environment independence in the control population. We illustrate the methods by applying them to data from a population-based case–control study on ovarian cancer conducted in Israel. A simulation study is conducted to compare our method with other popular choices. The results reflect that the semiparametric Bayesian model allows incorporation of key scientific evidence in the form of a prior and offers a flexible, robust alternative when standard parametric model assumptions do not hold.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65893/1/j.1541-0420.2007.00750.x.pd
Regression inference for multiple populations by integrating summary-level data using stacked imputations
There is a growing need for flexible general frameworks that integrate
individual-level data with external summary information for improved
statistical inference. This paper proposes an imputation-based methodology
where the goal is to fit an outcome regression model with all available
variables in the internal study while utilizing summary information from
external models that may have used only a subset of the predictors. The method
allows for heterogeneity of covariate effects across the external populations.
The proposed approach generates synthetic outcome data in each population, uses
stacked multiple imputation to create a long dataset with complete covariate
information, and finally analyzes the imputed data with weighted regression.
This flexible and unified approach attains the following four objectives: (i)
incorporating supplementary information from a broad class of externally fitted
predictive models or established risk calculators which could be based on
parametric regression or machine learning methods, as long as the external
model can generate outcome values given covariates; (ii) improving statistical
efficiency of the estimated coefficients in the internal study; (iii) improving
predictions by utilizing even partial information available from models that
uses a subset of the full set of covariates used in the internal study; and
(iv) providing valid statistical inference for the external population with
potentially different covariate effects from the internal population.
Applications include prostate cancer risk prediction models using novel
biomarkers that are measured only in the internal study
- …