738 research outputs found

    A note on bias due to fitting prospective multivariate generalized linear models to categorical outcomes ignoring retrospective sampling schemes

    Get PDF
    AbstractOutcome-dependent sampling designs are commonly used in economics, market research and epidemiological studies. Case-control sampling design is a classic example of outcome-dependent sampling, where exposure information is collected on subjects conditional on their disease status. In many situations, the outcome under consideration may have multiple categories instead of a simple dichotomization. For example, in a case-control study, there may be disease sub-classification among the “cases” based on progression of the disease, or in terms of other histological and morphological characteristics of the disease. In this note, we investigate the issue of fitting prospective multivariate generalized linear models to such multiple-category outcome data, ignoring the retrospective nature of the sampling design. We first provide a set of necessary and sufficient conditions for the link functions that will allow for equivalence of prospective and retrospective inference for the parameters of interest. We show that for categorical outcomes, prospective–retrospective equivalence does not hold beyond the generalized multinomial logit link. We then derive an approximate expression for the bias incurred when link functions outside this class are used. Most popular models for ordinal response fall outside the multiplicative intercept class and one should be cautious while performing a naive prospective analysis of such data as the bias could be substantial. We illustrate the extent of bias through a real data example, based on the ongoing Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial by the National Cancer Institute. The simulations based on the real study illustrate that the bias approximations work well in practice

    A note on bias due to fitting prospective multivariate generalized linear models to categorical outcomes ignoring retrospective sampling schemes

    Get PDF
    Outcome dependent sampling designs are commonly used in economics, market research and epidemiological studies. Case-control sampling design is a classic example of outcome dependent sampling, where exposure information is collected on subjects conditional on their disease status. In many situations, the outcome under consideration may have multiple categories instead of a simple dichotomization. For example, in a case-control study, there may be disease sub-classification among the “cases” based on progression of the disease, or in terms of other histological and morphological characteristics of the disease. In this note, we investigate the issue of fitting prospective multivariate generalized linear models to such multiple-category outcome data, ignoring the retrospective nature of the sampling design. We first provide a set of necessary and sufficient conditions for the link functions that will allow for equivalence of prospective and retrospective inference for the parameters of interest. We show that for categorical outcomes, prospective-retrospective equivalence does not hold beyond the generalized multinomial logit link. We then derive an approximate expression for the bias incurred when link functions outside this class are used. We illustrate the extent of bias through a real data example, based on the ongoing Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial by the National Cancer Institute

    The analysis of data where response or selection is dependent on the variable of interest

    Get PDF
    In surveys of sensitive subjects non response may be dependent on the variable of interest, both at the unit and item levels. In some clinical and epidemiological studies, units are selected for entry on the basis of the outcome variable of interest. Both of these scenarios pose problems for statistical analysis, and standard techniques may be invalid or inefficient, except in some special cases. A new approach to the analysis of surveys of sensitive topics is developed, central to which is at least one variable which represents the enthusiasm to participate. This variable is included along with demographic variables in the calculation of a response propensity score. The score is derived as the fitted probabilities of item non-response to the question of interest. The distribution of the score for the unit non-responders is assumed equal to that of item non-responders. Response is assumed independent of the variable of interest, conditional on the score. Weights based on the score can be used to derive unbiased estimates of the distribution of the variable of interest. The bootstrap is recommended for confidence interval construction. The technique is applied to data from the National Survey of Sexual Attitudes and Lifestyles. A simplification of the technique is developed that does not use the bootstrap, and which enables users to analyse the data without knowledge of the factors affecting non-response, and using standard statistical software. To analyse the time from an initiating event to illness, a prospective study may be regarded as the optimal design. However, additional data from those already with the illness and still alive may also be available. A standard technique would be to ignore the additional data, and left-truncate the times to illness at study entry. We develop a full likelihood approach, and a weighted pseudo likelihood approach, and compare these with the standard truncated data approach. The techniques are used to fit simple models of time to illness based on data from a study of time to AIDS from HIV seroconversion

    Accuracy of logistic models and receiver operating characteristic curves

    Get PDF
    The accuracy of prediction is a commonly studied topic in modern statistics. The performance of a predictor is becoming increasingly more important as real-life decisions axe made on the basis of prediction. In this thesis we investigate the prediction accuracy of logistic models from two different approaches. Logistic regression is often used to discriminate between two groups or populations based on a number of covariates. The receiver operating characteristic (ROC) curve is a commonly used tool (especially in medical statistics) to assess the performance Of such a score or test. By using the same data to fit the logistic regression and calculate the ROC curve we overestimate the performance that the score would give if validated on a sample of future cases. This overestimation is studied and we propose a correction for the ROC curve and the area under the curve. The methods axe illustrated through way of two medical examples and a simulation study, and we show that the overestimation can be quite substantial for small sample sizes. The idea of shrinkage pertains to the notion that by including some prior information about the data under study we can improve prediction. Until now, the study of shrinkage has almost exclusively been concentrated on continuous measurements. We propose a methodology to study shrinkage for logistic regression modelling of categorical data with a binary response. Categorical data with a large number of levels is often grouped for modelling purposes, which discards useful information about the data. By using this information we can apply Bayesian methods to update model parameters and show through examples and simulations that in some circumstances the updated estimates are better predictors than the model

    Statistical inferences for outcome dependent sampling design with multivariate outcomes

    Get PDF
    An outcome-dependent sampling (ODS) design has been shown to be a cost-effective sampling scheme. In the ODS design with a continuous outcome variable, one observes the exposure with a probability, maybe unknown, depending on the outcome. In practice, multivariate data arise in many contexts, such as longitudinal data or cluster units. While the ODS design has been an interest in statistical and applied literature, the statistical inference procedures for such design with multivariate cases still remain undeveloped. We develop a general sampling design and inference methods using the ODS under continuous multivariate settings (Multivariate-ODS). The standard estimation methods for multivariate data ignoring the Multivariate-ODS design will yield biased and inconsistent estimates. Therefore, new statistical methods are needed to reap the benefits of a Multivariate-ODS design. In this dissertation, we propose three commonly occurring ODS sampling strategies and study the new semiparametric methods for estimating regression parameters. We allow a simple random sample (SRS) in all three sampling strategies and the difference is how the supplemental samples are selected. The first design, the Multivariate-ODS with a maximum selection criterion, selects the supplemental sample based on whether the maximum value of the outcomes from an individual exceeds a known cutpoint; the second design, the Multivariate-ODS with a summation criterion, draws the supplemental sample based on whether the sums of the outcome values are above a given cutpoint; the third design, the Multivariate-ODS with a general criterion, is a more general design where the selection of the supplemental samples is based on each individual's responses, instead of on the aggregate of the outcomes. The proposed estimators are semiparametric in the sense that the underlying distributions of covariates are left unspecified and modeled nonparametrically using the empirical likelihood methods. We show that the proposed estimators are consistent and have asymptotically normality properties. Simulation studies illustrate that the proposed estimators are more efficient than other competing estimators. We also apply the proposed methods to a real data study. The results of these applications support our claim that substantial efficiency gains can be achieved by the Multivariate-ODS design, which provides an efficient alternative to conduct multivariate studies

    Estimating equations approaches to nuisance parameters and outcome-dependent sampling problems for marginal regression models and generalized linear mixed models when outcomes are correlated

    Get PDF
    For marginal regression models having cluster-specific intercepts, the number of model parameters grows with the sample size so that GEE is not feasible. A solution is to impose a mixing distribution on the intercepts which leads to generalized linear mixed models (GLMMs) whose parameters have different interpretations than marginal models. When GLMM assumptions are not met, parameter estimates are generally biased. A simple procedure for constructing estimating equations is proposed that enables consistent estimation of parameters associated with cluster-varying covariates and is applicable regardless of whether the cluster-specific intercept is treated as fixed or random. The proposed procedure is shown to work for the identity and log links but not for the logit link. Connections to conditional likelihoods, the Cox model, projected score, and adjusted profile likelihoods are discussed. It is shown that our estimating equations can be implemented with minimal programming effort using existing software. We show that a connection exists between biased sampling based on cluster totals and regression models with cluster-specific intercepts. This connection leads naturally to our estimation procedure. Regression parameters associated with cluster-varying covariates can be consistently estimated using our estimating function even when sampling rates are unknown. An estimation procedure based on the double-pair design and an estimating function for a 1-1 matching design are shown to be special cases of our procedure. Risk ratio estimation is possible for case-control studies when family members are chosen as controls

    Bayesian Modeling of Epidemiologic Data under Complex Sampling Schemes.

    Full text link
    Case-control studies are dominant analytic tools in epidemiologic research for identifying potential risk factors of a disease. We explore three atypical data situations under a case-control sampling framework. We adopt the Bayesian paradigm as the inferential strategy. The first problem we consider is modeling disease subtypes in matched case-control studies using the stereotype regression model. The stereotype regression model (Anderson, 1984), is a relatively unexplored class of models for categorical outcomes, and can be adapted to model both ordered and unordered categorical outcomes. Classical frequentist inference for this model is problematic due to non-linearity and lack of identifiability in the parameter space. We propose a general Bayesian analyses and then extend it to deal with non-ignorable missingness in the covariates. We illustrate our methods by modeling cancer stages in studies of prostate and colorectal cancer. The second problem involves modeling gene-environment interactions under a two-phase sampling design. We consider the situation where the Phase I sample contains basic demographic and environmental covariates. The Phase II sample is selected by stratified sampling conditional on case-control status and environmental exposures. Genotype data with potential non-monotone missingness is available only on Phase II samples. We build a semi-parametric Bayesian model that data adaptively relaxes both the gene-gene and gene-environment independence assumptions. We introduce a variable selection strategy that can simultaneously handle multiple genes, environmental covariates and their interactions. We compare the Bayesian methodology with weighted and pseudo likelihood approaches. The third problem is motivated by a serial case-control study on diarrheal disease incidence in Ecuador where disease outcomes were recorded with geographical coordinates in a sample of 21 villages over a period of six years. We propose a Bayesian two stage spatio-temporal point process model to explain variation in diarrheal case patterns by using a log Gaussian Cox process with spatial and temporal components. Beyond estimation of model parameters, we also consider the problem of predicting the number of diarrheal cases at unsampled communities and compare our prediction with that obtained by a standard Kriging approach.Ph.D.BiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89687/1/jaeil_1.pd

    Statistical Modelling of Breastfeeding Data

    Get PDF
    This thesis addresses some key methodological problems in statistical modelling of breastfeeding data. Meta-analysis techniques were used to analyse aggregated breastfeeding data. Generalised linear mixed model and an extended Cox model were used with time-varying exposures to analyse longitudinal and time-to-event breastfeeding data, respectively. Shared frailty models were applied to correlated breastfeeding duration data controlling for heterogeneity. A novel two-part mixed-effects model was proposed for modelling clustered time-to-event breastfeeding data with clumping at zero
    • …
    corecore