8 research outputs found

    Variable selection in multivariate multiple regression

    Get PDF
    Introduction In many practical situations, we are interested in the effect of covariates on correlated multiple responses. In this paper, we focus on estimation and variable selection in multi-response multiple regression models. Correlation among the response variables must be modeled for valid inference. Method We used an extension of the generalized estimating equation (GEE) methodology to simultaneously analyze binary, count, and continuous outcomes with nonlinear functions. Variable selection plays an important role in modeling correlated responses because of the large number of model parameters that must be estimated. We propose a penalized-likelihood approach based on the extended GEEs for simultaneous parameter estimation and variable selection. Results and conclusions We conducted a series of Monte Carlo simulations to investigate the performance of our method, considering different sample sizes and numbers of response variables. The results showed that our method works well compared to treating the responses as uncorrelated. We recommend using an unstructured correlation model with the Bayesian information criterion (BIC) to select the tuning parameters. We demonstrated our method using data from a concrete slump test

    Ensemble-based Classification Models for Predicting Post-Operative Mortality Risk in Coronary Artery Disease

    Get PDF
    Introduction There has been an increased demand for more accurate prediction tools to aid clinical decision-making regarding disease diagnosis prognosis for coronary artery disease(CAD) patients. Patients undergoing CABG surgery are older and a larger number have had previous heart surgery. Consequently, mortality after CABG is expected to increase despite procedural advances. Objectives and Approach This study aims to compare the predictive performance of random forest(RF) and logistic regression(LR) classifiers for predicting 30-day and 1-year post-operative mortality risk in CAD patients who underwent CABG. Data was obtained by linking the Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease(APPROACH) registry, a prospective longitudinal data of patients undergoing cardiac catheterization in Alberta, Canada, to vital statistics database. All patients who underwent first-time isolated CABG between January 1, 2007 and December 31, 2012 were included in the analysis. Area under the receiver operating curve(AUC) was used to compare the predictive performance of LR and RF regression. Results Of the 4,908 eligible subjects who underwent isolated CABG during the study period, mortality estimates of 30-day and 1-year post CABG surgery were 1.59% and 3.85%, respectively. Descriptive analysis revealed that age, sex, hypertension, dialysis, cerebrovascular disease, chronic obstructive pulmonary disease, and chronic heart failure were associated with 30-day and 1-year mortality. The accuracy of the LR and RF regression classifiers in predicting 30-day mortality were 74.1, and 99.7%, respectively. While the accuracy of the former and latter classifiers in predicting 1-year post CABG mortality were 74% and 97.4%, respectively. Conclusion/Implications This study shows that RF classifier results in better predictive accuracy than LR in predicting post-operating mortality risk in CAD patients. Machine learning models are potentially usefully for developing clinical prediction models that can be used to aid the monitoring of post-discharge outcomes in the management of cardiovascular diseases

    Variable selection in multivariate multiple regression

    Get PDF
    Multivariate analysis is a common statistical tool for assessing covariate effects when only one response or multiple response variables of the same type are collected in experimental studies. However with mixed continuous and discrete outcomes, traditional modeling approaches are no longer appropriate. The common approach used to make inference is to model each outcome separately ignoring the potential correlation among the responses. However a statistical analysis that incorporates association may result in improved precision. Coffey and Gennings (2007a) proposed an extension of the generalized estimating equations (GEE) methodology to simultaneously analyze binary, count and continuous outcomes with nonlinear functions. Variable selection plays a pivotal role in modeling correlated responses due to large number of covariate variables involved. Thus a parsimonious model is always desirable to enhance model predictability and interpretation. To perform parameter estimation and variable selection simultaneously in the presence of mixed discrete and continuous outcomes, we propose a penalized based approach of the extended generalized estimating equations. This approach only require to specify the first two marginal moments and a working correlation structure. An advantageous feature of the penalized GEE is that the consistency of the model holds even if the working correlation is misspecified. However it is important to use appropriate working correlation structure in small samples since it improves the statistical efficiency of the regression parameters. We develop a computational algorithm for estimating the parameters using local quadratic approximation (LQA) algorithm proposed by Fan and Li (2001). For tuning parameter selection, we explore the performance of unweighted Bayesian information criterion(BIC) and generalized cross validation (GCV) for least absolute shrinkage and selection operator(LASSO) and smoothly clipped absolute deviation (SCAD). We discuss the asymptotic properties for the penalized GEE estimator when the number of subjects n goes to infinity. Our simulation studies reveal that when correlated mixed outcomes are available, estimates of regression parameters are unbiased regardless of the choice of correlation structure. However, estimates obtained from the unstructured working correlation (UWC) have reduced standard errors. SCAD with BIC tuning criteria works well in selecting important variables. Our approach is applied to concrete slump test data set

    Classification Models for Multivariate Non-normal Repeated Measures Data

    No full text
    Multivariate repeated measures data, in which multiple outcomes are repeatedly measured at two or more occasions, are commonly collected in several disciplines (e.g., medicine, ecology, environmental sciences), where investigators seek to discriminate between population groups or make predictions based on changes in multiple correlated outcomes over time. Repeated measures discriminant analysis have been developed and applied to address these research questions. These classification models, which have been mostly developed based on growth curve models, covariance pattern models, and mixed-effects models, are advantageous in that they can account for complex correlation structures in multivariate repeated measures data (e.g., within-outcome and between-outcome correlations) to improve their predictive accuracy. However, they largely rely on the assumption of multivariate normality, which is rarely satisfied in multivariate repeated measures data. To our knowledge, there has been limited investigation of the behavior of these existing models in multivariate non-normal repeated measures data. The overarching goal of this research was to develop robust repeated measures discriminant analysis classifiers for multivariate non-normal repeated measures data. Specifically, we developed repeated measures discriminant analysis based on maximum trimmed likelihood estimators (MTLE) and generalized estimating equations (GEE) estimators and examine their accuracy in comparison to classifiers based on maximum likelihood estimation (MLE) using Monte Carlo methods. The simulation conditions examined, included population distribution, sample size, covariance structure (between-outcomes and within-outcome), covariance heterogeneity, repeated number of occasions, and number of outcome variables. The Monte Carlo study results indicated that the proposed methods increased overall mean classification accuracy by 2% - 15% in multivariate non-normal repeated measures data compared to repeated measures discriminant analysis based on MLE under most scenarios. Data from two cohort studies were used to illustrate the implementation of the proposed repeated measures discriminant analysis methods. The outcomes of this research includes novel multivariate classifiers for predicting group membership in multivariate normal and non-normal repeated measures data. This research contributes to the advancement of statistical science on methods for analyzing multivariate repeated measures data

    Multivariate Trajectories Across Multiple Domains of Health-Related Quality of Life in Children with New-Onset Epilepsy

    No full text
    The diagnosis of epilepsy in children is known to impact the trajectory of their health-related quality of life (HRQOL) over time. However, there is limited knowledge about variations in longitudinal trajectories across multiple domains of HRQOL. This study aims to characterize the heterogeneity in HRQOL trajectories across multiple HRQOL domains and to evaluate predictors of differences among the identified trajectory groups in children with new-onset epilepsy. Data were obtained from the Health Related Quality of Life in Children with Epilepsy Study (HERQULES), a prospective multi-center study of 373 children newly diagnosed with new-onset epilepsy who were followed up over 2years. Child HRQOL and family factors were reported by parents, and clinical characteristics were reported by neurologists. Group-based multi-trajectory modeling was adopted to characterize longitudinal trajectories of HRQOL as measured by the individual domains of cognitive, emotional, physical, and social functioning in the 55-item Quality of Life in Childhood Epilepsy Questionnaire (QOLCE-55). Multinomial logistic regression was used to assess potential factors that explain differences among the identified latent trajectory groups. Three distinct HRQOL trajectory subgroups were identified in children with new-onset epilepsy based on HRQOL scores: High (44.7%), Intermediate (37.0%), and Low (18.3%). While most trajectory groups exhibited increasing scores over time on physical and social domains, both flat and declining trajectories were noted on emotional and cognitive domains. Less severe epilepsy, an absence of cognitive and behavioral problems, lower parental depression scores, better family functioning, and fewer family demands were associated with a Higher or Intermediate HRQOL trajectory. The course of HRQOL over time in children with new-onset epilepsy appears to follow one of three different trajectories. Addressing the clinical and psychosocial determinants identified for each pattern can help clinicians provide more targeted care to these children and their families

    Adaptation of the Wound Healing Questionnaire universal-reporter outcome measure for use in global surgery trials (TALON-1 study): mixed-methods study and Rasch analysis

    No full text
    BackgroundThe Bluebelle Wound Healing Questionnaire (WHQ) is a universal-reporter outcome measure developed in the UK for remote detection of surgical-site infection after abdominal surgery. This study aimed to explore cross-cultural equivalence, acceptability, and content validity of the WHQ for use across low- and middle-income countries, and to make recommendations for its adaptation.MethodsThis was a mixed-methods study within a trial (SWAT) embedded in an international randomized trial, conducted according to best practice guidelines, and co-produced with community and patient partners (TALON-1). Structured interviews and focus groups were used to gather data regarding cross-cultural, cross-contextual equivalence of the individual items and scale, and conduct a translatability assessment. Translation was completed into five languages in accordance with Mapi recommendations. Next, data from a prospective cohort (SWAT) were interpreted using Rasch analysis to explore scaling and measurement properties of the WHQ. Finally, qualitative and quantitative data were triangulated using a modified, exploratory, instrumental design model.ResultsIn the qualitative phase, 10 structured interviews and six focus groups took place with a total of 47 investigators across six countries. Themes related to comprehension, response mapping, retrieval, and judgement were identified with rich cross-cultural insights. In the quantitative phase, an exploratory Rasch model was fitted to data from 537 patients (369 excluding extremes). Owing to the number of extreme (floor) values, the overall level of power was low. The single WHQ scale satisfied tests of unidimensionality, indicating validity of the ordinal total WHQ score. There was significant overall model misfit of five items (5, 9, 14, 15, 16) and local dependency in 11 item pairs. The person separation index was estimated as 0.48 suggesting weak discrimination between classes, whereas Cronbach's α was high at 0.86. Triangulation of qualitative data with the Rasch analysis supported recommendations for cross-cultural adaptation of the WHQ items 1 (redness), 3 (clear fluid), 7 (deep wound opening), 10 (pain), 11 (fever), 15 (antibiotics), 16 (debridement), 18 (drainage), and 19 (reoperation). Changes to three item response categories (1, not at all; 2, a little; 3, a lot) were adopted for symptom items 1 to 10, and two categories (0, no; 1, yes) for item 11 (fever).ConclusionThis study made recommendations for cross-cultural adaptation of the WHQ for use in global surgical research and practice, using co-produced mixed-methods data from three continents. Translations are now available for implementation into remote wound assessment pathways
    corecore