24 research outputs found
An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license
Higher education delays and shortens cognitive impairment. A multistate life table analysis of the US Health and Retirement Study
Improved health may extend or shorten the duration of cognitive impairment by postponing incidence or death. We assess the duration of cognitive impairment in the US Health and Retirement Study (1992–2004) by self reported BMI, smoking and levels of education in men and women and three ethnic groups. We define multistate life tables by the transition rates to cognitive impairment, recovery and death and estimate Cox proportional hazard ratios for the studied determinants. 95% confidence intervals are obtained by bootstrapping. 55 year old white men and women expect to live 25.4 and 30.0 years, of which 1.7 [95% confidence intervals 1.5; 1.9] years and 2.7 [2.4; 2.9] years with cognitive impairment. Both black men and women live 3.7 [2.9; 4.5] years longer with cognitive impairment than whites, Hispanic men and women 3.2 [1.9; 4.6] and 5.8 [4.2; 7.5] years. BMI makes no difference. Smoking decreases the duration of cognitive impairment with 0.8 [0.4; 1.3] years by high mortality. Highly educated men and women live longer, but 1.6 years [1.1; 2.2] and 1.9 years [1.6; 2.6] shorter with cognitive impairment than lowly educated men and women. The effect of education is more pronounced among ethnic minorities. Higher life expectancy goes together with a longer period of cognitive impairment, but not for higher levels of education: that extends life in good cognitive health but shortens the period of cognitive impairment. The increased duration of cognitive impairment in minority ethnic groups needs further study, also in Europe
Resolving paradoxes involving surrogate end points
We define a surrogate end point as a measure or indicator of a biological process that is obtained sooner, at less cost or less invasively than a true end point of health outcome and is used to make conclusions about the effect of an intervention on the true end point. Prentice presented criteria for valid hypothesis testing of a surrogate end point that replaces a true end point. For using the surrogate end point to estimate the predicted effect of intervention on the true end point, Day and Duffy assumed the Prentice criterion and arrived at two paradoxical results: the estimated predicted intervention effect by using a surrogate can give more precise estimates than the usual estimate of the intervention effect by using the true end point and the variance is greatest when the surrogate end point perfectly predicts the true end point. Begg and Leung formulated similar paradoxes and concluded that they indicate a flawed conceptual strategy arising from the Prentice criterion. We resolve the paradoxes as follows. Day and Duffy compared a surrogate-based estimate of the effect of intervention on the true end point with an estimate of the effect of intervention on the true end point that uses the true end point. Their paradox arose because the former estimate assumes the Prentice criterion whereas the latter does not. If both or neither of these estimates assume the Prentice criterion, there is no paradox. The paradoxes of Begg and Leung, although similar to those of Day and Duffy, arise from ignoring the variability of the parameter estimates irrespective of the Prentice criterion and disappear when the variability is included. Our resolution of the paradoxes provides a firm foundation for future meta-analytic extensions of the approach of Day and Duffy. Copyright 2005 Royal Statistical Society.
Sex differences in the prevalence of mobility disability in old age: The dynamics of incidence, recovery, and mortality
Objectives. This study examined sex differences in the prevalence of mobility disability in older adults according to the influences of three components of prevalence: disability incidence, recovery from disability, and mortality. Methods. Participants in a population-based study of older adults from three communities in the United States (N= 10,263) were studied for up to 7 years. Life table methods were used to estimate the influence of each of the three components of disability prevalence in women and men. Sex differences in probabilities for transition states were measured by relative risks derived from a single model using a Markov chain approach. Results. The proportion of disabled women increased from 22% of women aged 70 years to 81% of those aged 90 years. In men, comparable figures were 15% and 57%. Incidence had the greatest impact on the sex differences in disability prevalence until age 90 and older when recovery rates had a greater impact on differences in prevalence. Mortality differences in men and women had only a modest impact on sex differences in disability prevalence. These findings initially seemed to contradict striking sex differences observed in the relative risks for mortality in men compared with women. Subsequent graphical analyses showed that incidence rather than recovery or mortality largely accounted for sex differences in disability prevalence in old age. Conclusion. Disability incidence, recovery from disability, and mortality dynamically influence the sex differences in the prevalence of mobility disability. However, incidence has the greatest impact overall on the higher prevalence of disability in women compared with men