82 research outputs found

    Contributions aux modèles de régression avec réponses manquantes : risques concurrents et données longitudinales

    Get PDF
    Missing data are a common occurrence in medical studies. In regression modeling, missing outcomes limit our capability to draw inferences about the covariate effects of medical interest, which are those describing the distribution of the entire set of planned outcomes. In addition to losing precision, the validity of any method used to draw inferences from the observed data will require that some assumption about the mechanism leading to missing outcomes holds. Rubin (1976, Biometrika, 63:581-592) called the missingness mechanism MAR (for “missing at random”) if the probability of an outcome being missing does not depend on missing outcomes when conditioning on the observed data, and MNAR (for “missing not at random”) otherwise. This distinction has important implications regarding the modeling requirements to draw valid inferences from the available data, but generally it is not possible to assess from these data whether the missingness mechanism is MAR or MNAR. Hence, sensitivity analyses should be routinely performed to assess the robustness of inferences to assumptions about the missingness mechanism. In the field of incomplete multivariate data, in which the outcomes are gathered in a vector for which some components may be missing, MAR methods are widely available and increasingly used, and several MNAR modeling strategies have also been proposed. On the other hand, although some sensitivity analysis methodology has been developed, this is still an active area of research. The first aim of this dissertation was to develop a sensitivity analysis approach for continuous longitudinal data with drop-outs, that is, continuous outcomes that are ordered in time and completely observed for each individual up to a certain time-point, at which the individual drops-out so that all the subsequent outcomes are missing. The proposed approach consists in assessing the inferences obtained across a family of MNAR pattern-mixture models indexed by a so-called sensitivity parameter that quantifies the departure from MAR. The approach was prompted by a randomized clinical trial investigating the benefits of a treatment for sleep-maintenance insomnia, from which 22% of the individuals had dropped-out before the study end. The second aim was to build on the existing theory for incomplete multivariate data to develop methods for competing risks data with missing causes of failure. The competing risks model is an extension of the standard survival analysis model in which failures from different causes are distinguished. Strategies for modeling competing risks functionals, such as the cause-specific hazards (CSH) and the cumulative incidence function (CIF), generally assume that the cause of failure is known for all patients, but this is not always the case. Some methods for regression with missing causes under the MAR assumption have already been proposed, especially for semi-parametric modeling of the CSH. But other useful models have received little attention, and MNAR modeling and sensitivity analysis approaches have never been considered in this setting. We propose a general framework for semi-parametric regression modeling of the CIF under MAR using inverse probability weighting and multiple imputation ideas. Also under MAR, we propose a direct likelihood approach for parametric regression modeling of the CSH and the CIF. Furthermore, we consider MNAR pattern-mixture models in the context of sensitivity analyses. In the competing risks literature, a starting point for methodological developments for handling missing causes was a stage II breast cancer randomized clinical trial in which 23% of the deceased women had missing cause of death. We use these data to illustrate the practical value of the proposed approaches.Les données manquantes sont fréquentes dans les études médicales. Dans les modèles de régression, les réponses manquantes limitent notre capacité à faire des inférences sur les effets des covariables décrivant la distribution de la totalité des réponses prévues sur laquelle porte l'intérêt médical. Outre la perte de précision, toute inférence statistique requière qu'une hypothèse sur le mécanisme de manquement soit vérifiée. Rubin (1976, Biometrika, 63:581-592) a appelé le mécanisme de manquement MAR (pour les sigles en anglais de « manquant au hasard ») si la probabilité qu'une réponse soit manquante ne dépend pas des réponses manquantes conditionnellement aux données observées, et MNAR (pour les sigles en anglais de « manquant non au hasard ») autrement. Cette distinction a des implications importantes pour la modélisation, mais en général il n'est pas possible de déterminer si le mécanisme de manquement est MAR ou MNAR à partir des données disponibles. Par conséquent, il est indispensable d'effectuer des analyses de sensibilité pour évaluer la robustesse des inférences aux hypothèses de manquement.Pour les données multivariées incomplètes, c'est-à-dire, lorsque l'intérêt porte sur un vecteur de réponses dont certaines composantes peuvent être manquantes, plusieurs méthodes de modélisation sous l'hypothèse MAR et, dans une moindre mesure, sous l'hypothèse MNAR ont été proposées. En revanche, le développement de méthodes pour effectuer des analyses de sensibilité est un domaine actif de recherche. Le premier objectif de cette thèse était de développer une méthode d'analyse de sensibilité pour les données longitudinales continues avec des sorties d'étude, c'est-à-dire, pour les réponses continues, ordonnées dans le temps, qui sont complètement observées pour chaque individu jusqu'à la fin de l'étude ou jusqu'à ce qu'il sorte définitivement de l'étude. Dans l'approche proposée, on évalue les inférences obtenues à partir d'une famille de modèles MNAR dits « de mélange de profils », indexés par un paramètre qui quantifie le départ par rapport à l'hypothèse MAR. La méthode a été motivée par un essai clinique étudiant un traitement pour le trouble du maintien du sommeil, durant lequel 22% des individus sont sortis de l'étude avant la fin.Le second objectif était de développer des méthodes pour la modélisation de risques concurrents avec des causes d'évènement manquantes en s'appuyant sur la théorie existante pour les données multivariées incomplètes. Les risques concurrents apparaissent comme une extension du modèle standard de l'analyse de survie où l'on distingue le type d'évènement ou la cause l'ayant entrainé. Les méthodes pour modéliser le risque cause-spécifique et la fonction d'incidence cumulée supposent en général que la cause d'évènement est connue pour tous les individus, ce qui n'est pas toujours le cas. Certains auteurs ont proposé des méthodes de régression gérant les causes manquantes sous l'hypothèse MAR, notamment pour la modélisation semi-paramétrique du risque. Mais d'autres modèles n'ont pas été considérés, de même que la modélisation sous MNAR et les analyses de sensibilité. Nous proposons des estimateurs pondérés et une approche par imputation multiple pour la modélisation semi-paramétrique de l'incidence cumulée sous l'hypothèse MAR. En outre, nous étudions une approche par maximum de vraisemblance pour la modélisation paramétrique du risque et de l'incidence sous MAR. Enfin, nous considérons des modèles de mélange de profils dans le contexte des analyses de sensibilité. Un essai clinique étudiant un traitement pour le cancer du sein de stade II avec 23% des causes de décès manquantes sert à illustrer les méthodes proposées

    On the uses and abuses of regression models: a call for reform of statistical practice and teaching

    Full text link
    When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for "modelling data", underpinned by what we describe as the "true model myth", according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided "adjustment" for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of "input" variables, but their conceptualisation and usage should follow from the purpose at hand.Comment: 24 pages main document including 3 figures, plus 15 pages supplementary material. Based on plenary lecture (President's Invited Speaker) delivered to ISCB43, Newcastle, UK, August 2022. Submitted for publication 12-Sep-2

    Confounding-adjustment methods for the causal difference in medians

    Get PDF
    Background With continuous outcomes, the average causal effect is typically defined using a contrast of expected potential outcomes. However, in the presence of skewed outcome data, the expectation (population mean) may no longer be meaningful. In practice the typical approach is to continue defining the estimand this way or transform the outcome to obtain a more symmetric distribution, although neither approach may be entirely satisfactory. Alternatively the causal effect can be redefined as a contrast of median potential outcomes, yet discussion of confounding-adjustment methods to estimate the causal difference in medians is limited. In this study we described and compared confounding-adjustment methods to address this gap. Methods The methods considered were multivariable quantile regression, an inverse probability weighted (IPW) estimator, weighted quantile regression (another form of IPW) and two little-known implementations of g-computation for this problem. Methods were evaluated within a simulation study under varying degrees of skewness in the outcome and applied to an empirical study using data from the Longitudinal Study of Australian Children. Results Simulation results indicated the IPW estimator, weighted quantile regression and g-computation implementations minimised bias across all settings when the relevant models were correctly specified, with g-computation additionally minimising the variance. Multivariable quantile regression, which relies on a constant-effect assumption, consistently yielded biased results. Application to the empirical study illustrated the practical value of these methods. Conclusion The presented methods provide appealing avenues for estimating the causal difference in medians.Peer reviewe

    Impact of early intervention on the population prevalence of common mental disorders:20-year prospective study

    Get PDF
    BACKGROUND: The potential for early interventions to reduce the later prevalence of common mental disorders (CMD) first experienced in adolescence is unclear. AIMS: To examine the course of CMD and evaluate the extent to which the prevalence of CMD could be reduced by preventing adolescent CMD, or by intervening to change four young adult processes, between the ages of 20 and 29 years, that could be mediating the link between adolescent and adult disorder. METHOD: This was a prospective cohort study of 1923 Australian participants assessed repeatedly from adolescence (wave 1, mean age 14 years) to adulthood (wave 10, mean age 35 years). Causal mediation analysis was undertaken to evaluate the extent to which the prevalence of CMD at age 35 years in those with adolescent CMD could be reduced by either preventing adolescent CMD, or by intervening on four young adult mediating processes: the occurrence of young adult CMD, frequent cannabis use, parenting a child by age 24 years, and engagement in higher education and employment. RESULTS: At age 35, 19.2% of participants reported CMD; a quarter of these participants experienced CMD during both adolescence and young adulthood. In total, 49% of those with CMD during both adolescence and young adulthood went on to report CMD at age 35 years. Preventing adolescent CMD reduced the population prevalence at age 35 years by 3.9%. Intervening on all four young adult processes among those with adolescent CMD, reduced this prevalence by 1.6%. CONCLUSIONS: In this Australian cohort, a large proportion of adolescent CMD resolved by adulthood, and by age 35 years, the largest proportion of CMD emerged among individuals without prior CMD. Time-limited, early intervention in those with earlier adolescent disorder is unlikely to substantially reduce the prevalence of CMD in midlife

    Mediation effects that emulate a target randomised trial:Simulation-based evaluation of ill-defined interventions on multiple mediators

    Get PDF
    Many epidemiological questions concern potential interventions to alter the pathways presumed to mediate an association. For example, we consider a study that investigates the benefit of interventions in young adulthood for ameliorating the poorer mid-life psychosocial outcomes of adolescent self-harmers relative to their healthy peers. Two methodological challenges arise. First, mediation methods have hitherto mostly focused on the elusive task of discovering pathways, rather than on the evaluation of mediator interventions. Second, the complexity of such questions is invariably such that there are no well-defined mediator interventions (i.e. actual treatments, programs, etc.) for which data exist on the relevant populations, outcomes and time-spans of interest. Instead, researchers must rely on exposure (non-intervention) data, that is, on mediator measures such as depression symptoms for which the actual interventions that one might implement to alter them are not well defined. We propose a novel framework that addresses these challenges by defining mediation effects that map to a target trial of hypothetical interventions targeting multiple mediators for which we simulate the effects. Specifically, we specify a target trial addressing three policy-relevant questions, regarding the impacts of hypothetical interventions that would shift the mediators' distributions (separately under various interdependence assumptions, jointly or sequentially) to user-specified distributions that can be emulated with the observed data. We then define novel interventional effects that map to this trial, simulating shifts by setting mediators to random draws from those distributions. We show that estimation using a g-computation method is possible under an expanded set of causal assumptions relative to inference with well-defined interventions, which reflects the lower level of evidence that is expected with ill-defined interventions. Application to the self-harm example in the Victorian Adolescent Health Cohort Study illustrates the value of our proposal for informing the design and evaluation of actual interventions in the future

    Multiple imputation for longitudinal data: A tutorial

    Full text link
    Longitudinal studies are frequently used in medical research and involve collecting repeated measures on individuals over time. Observations from the same individual are invariably correlated and thus an analytic approach that accounts for this clustering by individual is required. While almost all research suffers from missing data, this can be particularly problematic in longitudinal studies as participation often becomes harder to maintain over time. Multiple imputation (MI) is widely used to handle missing data in such studies. When using MI, it is important that the imputation model is compatible with the proposed analysis model. In a longitudinal analysis, this implies that the clustering considered in the analysis model should be reflected in the imputation process. Several MI approaches have been proposed to impute incomplete longitudinal data, such as treating repeated measurements of the same variable as distinct variables or using generalized linear mixed imputation models. However, the uptake of these methods has been limited, as they require additional data manipulation and use of advanced imputation procedures. In this tutorial, we review the available MI approaches that can be used for handling incomplete longitudinal data, including where individuals are clustered within higher-level clusters. We illustrate implementation with replicable R and Stata code using a case study from the Childhood to Adolescence Transition Study.Comment: 35 pages, 3 figure

    Handling missing data when estimating causal effects with targeted maximum likelihood estimation

    Get PDF
    Targeted maximum likelihood estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on data (1992-1998) from the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate 8 missing-data methods in this context: complete-case analysis, extended TMLE incorporating an outcome-missingness model, the missing covariate missing indicator method, and 5 multiple imputation (MI) approaches using parametric or machine-learning models. We considered 6 scenarios that varied in terms of exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/nonlinear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a nonlinear term. When choosing a method for handling missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and nonlinearities is expected to perform well

    Gaps in the usage and reporting of multiple imputation for incomplete data: findings from a scoping review of observational studies addressing causal questions

    Get PDF
    Background: Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions (“missing completely at random”, “missing at random” [MAR], “missing not at random”) are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation. Methods: We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically. Results: Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis. Conclusion: Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data
    • …
    corecore