54 research outputs found

    An experimental study of the intrinsic stability of random forest variable importance measures

    Get PDF
    BACKGROUND: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. RESULTS: The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. CONCLUSION: First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets

    Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

    Get PDF
    BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data

    Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study

    Get PDF
    Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of multivariate models in interpreting biological data sets often remains elusive. We compare various multivariate regression and variate selection methods applied to a nutrigenomics data set in terms of performance, utility and biological interpretability. The studied data set comprised hepatic transcriptome (10,072 predictor variates) and plasma protein concentrations [2 dependent variates: Leptin (LEP) and Tissue inhibitor of metalloproteinase 1 (TIMP-1)] collected during a high-fat diet study in ApoE3Leiden mice. The multivariate regression methods used were: partial least squares “PLS”; a genetic algorithm-based multiple linear regression, “GA-MLR”; two least-angle shrinkage methods, “LASSO” and “ELASTIC NET”; and a variant of PLS that uses covariance-based variate selection, “CovProc.” Two methods of ranking the genes for Gene Set Enrichment Analysis (GSEA) were also investigated: either by their correlation with the protein data or by the stability of the PLS regression coefficients. The regression methods performed similarly, with CovProc and GA performing the best and worst, respectively (R-squared values based on “double cross-validation” predictions of 0.762 and 0.451 for LEP; and 0.701 and 0.482 for TIMP-1). CovProc, LASSO and ELASTIC NET all produced parsimonious regression models and consistently identified small subsets of variates, with high commonality between the methods. Comparison of the gene ranking approaches found a high degree of agreement, with PLS-based ranking finding fewer significant gene sets. We recommend the use of CovProc for variate selection, in tandem with univariate methods, and the use of correlation-based ranking for GSEA-like pathway analysis methods

    Prodigious submarine landslides during the inception and early growth of volcanic islands

    Get PDF
    Volcanic island inception applies large stresses as the ocean crust domes in response to magma ascension and is loaded by eruption of lavas. There is currently limited information on when volcanic islands are initiated on the seafloor, and no information regarding the seafloor instabilities island inception may cause. The deep sea Madeira Abyssal Plain contains a 43 million year history of turbidites among which many originate from mass movements in the Canary Islands. Here, we investigate the composition and timing of a distinctive group of turbidites that we suggest represent a new unique record of large-volume submarine landslides triggered during the inception, submarine shield growth, and final subaerial emergence of the Canary Islands. These slides are predominantly multi-stage and yet represent among the largest mass movements on the Earth’s surface up to three or more-times larger than subaerial Canary Islands flank collapses. Thus whilst these deposits provide invaluable information on ocean island geodynamics they also represent a significant, and as yet unaccounted, marine geohazard

    Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon.</p> <p>Results</p> <p>We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data.</p> <p>Conclusions</p> <p>These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.</p

    Age-period-cohort analysis for trends in body mass index in Ireland

    Get PDF
    Background: Obesity is a growing problem worldwide and can often result in a variety of negative health outcomes. In this study we aim to apply partial least squares (PLS) methodology to estimate the separate effects of age, period and cohort on the trends in obesity as measured by body mass index (BMI). Methods. Using PLS we will obtain gender specific linear effects of age, period and cohort on obesity. We also explore and model nonlinear relationships of BMI with age, period and cohort. We analysed the results from 7,796 men and 10,220 women collected through the SLAN (Surveys of Lifestyle, attitudes and Nutrition) in Ireland in the years 1998, 2002 and 2007. Results: PLS analysis revealed a positive period effect over the years. Additionally, men born later tended to have lower BMI (-0.026 kg·m-2 yr-1, 95% CI: -0.030 to -0.024) and older men had in general higher BMI (0.029 kg·m -2 yr-1, 95% CI: 0.026 to 0.033). Similarly for women, those born later had lower BMI (-0.025 kg·m-2 yr-1, 95% CI: -0.029 to -0.022) and older women in general had higher BMI (0.029 kg·m-2 yr-1, 95% CI: 0.025 to 0.033). Nonlinear analyses revealed that BMI has a substantial curvilinear relationship with age, though less so with birth cohort. Conclusion: We notice a generally positive age and period effect but a slightly negative cohort effect. Knowing this, we have a better understanding of the different risk groups which allows for effective public intervention measures to be designed and targeted for these specific population subgroups

    A New Approach to Age-Period-Cohort Analysis Using Partial Least Squares Regression: The Trend in Blood Pressure in the Glasgow Alumni Cohort

    Get PDF
    Due to a problem of identification, how to estimate the distinct effects of age, time period and cohort has been a controversial issue in the analysis of trends in health outcomes in epidemiology. In this study, we propose a novel approach, partial least squares (PLS) analysis, to separate the effects of age, period, and cohort. Our example for illustration is taken from the Glasgow Alumni cohort. A total of 15,322 students (11,755 men and 3,567 women) received medical screening at the Glasgow University between 1948 and 1968. The aim is to investigate the secular trends in blood pressure over 1925 and 1950 while taking into account the year of examination and age at examination. We excluded students born before 1925 or aged over 25 years at examination and those with missing values in confounders from the analyses, resulting in 12,546 and 12,516 students for analysis of systolic and diastolic blood pressure, respectively. PLS analysis shows that both systolic and diastolic blood pressure increased with students' age, and students born later had on average lower blood pressure (SBP: −0.17 mmHg/per year [95% confidence intervals: −0.19 to −0.15] for men and −0.25 [−0.28 to −0.22] for women; DBP: −0.14 [−0.15 to −0.13] for men; −0.09 [−0.11 to −0.07] for women). PLS also shows a decreasing trend in blood pressure over the examination period. As identification is not a problem for PLS, it provides a flexible modelling strategy for age-period-cohort analysis. More emphasis is then required to clarify the substantive and conceptual issues surrounding the definitions and interpretations of age, period and cohort effects

    Genome size evolution at the speciation level: The cryptic species complex Brachionus plicatilis (Rotifera)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Studies on genome size variation in animals are rarely done at lower taxonomic levels, e.g., slightly above/below the species level. Yet, such variation might provide important clues on the tempo and mode of genome size evolution. In this study we used the flow-cytometry method to study the evolution of genome size in the rotifer <it>Brachionus plicatilis</it>, a cryptic species complex consisting of at least 14 closely related species.</p> <p>Results</p> <p>We found an unexpectedly high variation in this species complex, with genome sizes ranging approximately seven-fold (haploid '1C' genome sizes: 0.056-0.416 pg). Most of this variation (67%) could be ascribed to the major clades of the species complex, i.e. clades that are well separated according to most species definitions. However, we also found substantial variation (32%) at lower taxonomic levels - within and among genealogical species - and, interestingly, among species pairs that are not completely reproductively isolated. In one genealogical species, called <it>B</it>. 'Austria', we found greatly enlarged genome sizes that could roughly be approximated as multiples of the genomes of its closest relatives, which suggests that whole-genome duplications have occurred early during separation of this lineage. Overall, genome size was significantly correlated to egg size and body size, even though the latter became non-significant after controlling for phylogenetic non-independence.</p> <p>Conclusions</p> <p>Our study suggests that substantial genome size variation can build up early during speciation, potentially even among isolated populations. An alternative, but not mutually exclusive interpretation might be that reproductive isolation tends to build up unusually slow in this species complex.</p

    Genetic population structure of Anopheles gambiae in Equatorial Guinea

    Get PDF
    BACKGROUND: Patterns of genetic structure among mosquito vector populations in islands have received particular attention as these are considered potentially suitable sites for experimental trials on transgenic-based malaria control strategies. In this study, levels of genetic differentiation have been estimated between populations of Anopheles gambiae s.s. from the islands of Bioko and Annobón, and from continental Equatorial Guinea (EG) and Gabon. METHODS: Genotyping of 11 microsatellite loci located in chromosome 3 was performed in three island samples (two in Bioko and one in Annobón) and three mainland samples (two in EG and one in Gabon). Four samples belonged to the M molecular form and two to the S-form. Microsatellite data was used to estimate genetic diversity parameters, perform demographic equilibrium tests and analyse population differentiation. RESULTS: High levels of genetic differentiation were found between the more geographically remote island of Annobón and the continent, contrasting with the shallow differentiation between Bioko island, closest to mainland, and continental localities. In Bioko, differentiation between M and S forms was higher than that observed between island and mainland samples of the same molecular form. CONCLUSION: The observed patterns of population structure seem to be governed by the presence of both physical (the ocean) and biological (the M-S form discontinuity) barriers to gene flow. The significant degree of genetic isolation between M and S forms detected by microsatellite loci located outside the "genomic islands" of speciation identified in A. gambiae s.s. further supports the hypothesis of on-going incipient speciation within this species. The implications of these findings regarding vector control strategies are discussed
    corecore