3,432 research outputs found

    Empirical study of bagging predictors on medical data

    Full text link
    This study investigates the performance of bagging in terms of learning from imbalanced medical data. It is important for data miners to achieve highly accurate prediction models, and this is especially true for imbalanced medical applications. In these situations, practitioners are more interested in the minority class than the majority class; however, it is hard for a traditional supervised learning algorithm to achieve a highly accurate prediction on the minority class, even though it might achieve better results according to the most commonly used evaluation metric, Accuracy. Bagging is a simple yet effective ensemble method which has been applied to many real-world applications. However, some questions have not been well answered, e.g., whether bagging outperforms single learners on medical data-sets; which learners are the best predictors for each medical data-set; and what is the best predictive performance achievable for each medical data-set when we apply sampling techniques. We perform an extensive empirical study on the performance of 12 learning algorithms on 8 medical data-sets based on four performance measures: True Positive Rate (TPR), True Negative Rate (TNR), Geometric Mean (G-mean) of the accuracy rate of the majority class and the minority class, and Accuracy as evaluation metrics. In addition, the statistical analyses performed instil confidence in the validity of the conclusions of this research. © 2011, Australian Computer Society, Inc

    An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    Get PDF
    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

    Narrowing the Gap: Random Forests In Theory and In Practice

    Full text link
    Despite widespread interest and practical use, the theoretical properties of random forests are still not well understood. In this paper we contribute to this understanding in two ways. We present a new theoretically tractable variant of random regression forests and prove that our algorithm is consistent. We also provide an empirical evaluation, comparing our algorithm and other theoretically tractable random forest models to the random forest algorithm used in practice. Our experiments provide insight into the relative importance of different simplifications that theoreticians have made to obtain tractable models for analysis.Comment: Under review by the International Conference on Machine Learning (ICML) 201

    Ensemble predictions : empirical studies on learners' performance and sample distributions

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Imbalanced data problems are among the most challenging in Data Mining and Machine Learning research. This dissertation investigates the performance of ensemble learning systems on different types of data environments, and proposes novel ensemble learning approaches for solving imbalanced data problems. Bagging is one of the most effective ensemble methods for classification tasks. Despite the popularity of bagging in many real-world applications, there is a major drawback on extremely imbalanced data. Much research has addressed the problems of imbalanced data by using over-sampling and/or under-sampling methods to generate an equally balanced training set to improve the performance of the prediction models. However, it is unclear which is the best ratio for training, and under which conditions bagging is outperformed by other sampling schemes on extremely imbalanced data. Previous research has mainly been concerned with studying unstable learners as the key to ensuring the performance gain of a bagging predictor, with many key factors remaining unclear. Some questions have not been well answered: (1) What are the key factors for bagging predictors to achieve the best predictive performance for applications? and (2) What is the impact of varying the levels of class distribution on bagging predictors on different data environments. There is a lack of empirical investigation of these issues in the literature. The main contributions of this dissertation are as follows: 1. This dissertation proposes novel approaches, uneven balanced bagging to boost the performance of the prediction model for solving imbalanced problems, and hybrid-sampling to enhance bagging for solving highly imbalanced time series classification problems. 2. This dissertation asserts that robustness and stability are two key factors for building a high performance bagging predictor. This dissertation also derives a new method, utilizing two-dimensional robustness and stability decomposition to rank the base learners into different categories for the purpose of comparing the performance of bagging predictors with respect to different learning algorithms. The experimental results demonstrate that bagging is influenced by the combination of robustness and instability, and indicate that robustness is important for bagging to achieve a highly accurate prediction model. 3. This dissertation investigates the sensitivity of bagging predictors. We demonstrate that bagging MLP and NB are insensitive to different levels of imbalanced class distribution. 4. This dissertation investigates the impact of varying levels of class distribution on bagging predictors with different learning algorithms on a range of data environments, to allow data mining practitioners to choose the best learners and understand what to expect when using bagging predictors

    A Strategy analysis for genetic association studies with known inbreeding

    Get PDF
    Background: Association studies consist in identifying the genetic variants which are related to a specific disease through the use of statistical multiple hypothesis testing or segregation analysis in pedigrees. This type of studies has been very successful in the case of Mendelian monogenic disorders while it has been less successful in identifying genetic variants related to complex diseases where the insurgence depends on the interactions between different genes and the environment. The current technology allows to genotype more than a million of markers and this number has been rapidly increasing in the last years with the imputation based on templates sets and whole genome sequencing. This type of data introduces a great amount of noise in the statistical analysis and usually requires a great number of samples. Current methods seldom take into account gene-gene and gene-environment interactions which are fundamental especially in complex diseases. In this paper we propose to use a non-parametric additive model to detect the genetic variants related to diseases which accounts for interactions of unknown order. Although this is not new to the current literature, we show that in an isolated population, where the most related subjects share also most of their genetic code, the use of additive models may be improved if the available genealogical tree is taken into account. Specifically, we form a sample of cases and controls with the highest inbreeding by means of the Hungarian method, and estimate the set of genes/environmental variables, associated with the disease, by means of Random Forest. Results: We have evidence, from statistical theory, simulations and two applications, that we build a suitable procedure to eliminate stratification between cases and controls and that it also has enough precision in identifying genetic variants responsible for a disease. This procedure has been successfully used for the betathalassemia, which is a well known Mendelian disease, and also to the common asthma where we have identified candidate genes that underlie to the susceptibility of the asthma. Some of such candidate genes have been also found related to common asthma in the current literature. Conclusions: The data analysis approach, based on selecting the most related cases and controls along with the Random Forest model, is a powerful tool for detecting genetic variants associated to a disease in isolated populations. Moreover, this method provides also a prediction model that has accuracy in estimating the unknown disease status and that can be generally used to build kit tests for a wide class of Mendelian diseases
    • …
    corecore