298 research outputs found

    Gene set based ensemble methods for cancer classification

    Get PDF
    Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are â„“1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored

    A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

    Full text link
    The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013 International Conference on Data Minin

    Generalized weighting for bagged ensembles

    Get PDF
    Ensemble learning is a popular classification method where many individual simple learners contribute to a final prediction. Constructing an ensemble of learners has been shown to consistently improve prediction accuracy over a single learner. The most common types of ensembles include: bootstrap aggregated (bagged), boosted, and stacked. Each are different, yet has the same foundation of combining multiple learners. In this dissertation, we focus our attention to bagged ensembles; namely we propose a generalization by way of model weighting. The new method is motivated by the potential instability of averaging predictions of trees that may be of highly variable quality. To alleviate this, we replace the usual arithmetic average with a Cesaro average for weighted trees in the random forest. We provide both a theoretical analysis that gives exact conditions under which we would expect this weighted ensemble approach to do well, and numerical analysis that shows the new approach is competitive to other bagged ensembles when training a classification model on numerous realistic data sets. Going a step further we generalize our weights such that we allow simultaneous control over bias and variance. In particular, we introduce a regularization term that controls the variance reduction for bagged ensembles. Therefore, a new tunable weighted bagged ensemble framework is proposed, resulting in a very flexible method for classification. Using this methodology, we explore the impact tunable weighting has on the votes of each learner in an ensemble. To aid in the applicability of this body of work, the author discusses an R package that allows users to implement our proposed weighting scheme to arbitrary bagged ensembles. The package provides tools for constructing tunable bagged ensembles in the form of weights and is titled wbensembleR

    Advances and applications in Ensemble Learning

    Get PDF

    Brain Tumor Synthetic Segmentation in 3D Multimodal MRI Scans

    Full text link
    The magnetic resonance (MR) analysis of brain tumors is widely used for diagnosis and examination of tumor subregions. The overlapping area among the intensity distribution of healthy, enhancing, non-enhancing, and edema regions makes the automatic segmentation a challenging task. Here, we show that a convolutional neural network trained on high-contrast images can transform the intensity distribution of brain lesions in its internal subregions. Specifically, a generative adversarial network (GAN) is extended to synthesize high-contrast images. A comparison of these synthetic images and real images of brain tumor tissue in MR scans showed significant segmentation improvement and decreased the number of real channels for segmentation. The synthetic images are used as a substitute for real channels and can bypass real modalities in the multimodal brain tumor segmentation framework. Segmentation results on BraTS 2019 dataset demonstrate that our proposed approach can efficiently segment the tumor areas. In the end, we predict patient survival time based on volumetric features of the tumor subregions as well as the age of each case through several regression models

    Small margin ensembles can be robust to class-label noise

    Full text link
    This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing, VOL 160 (2015) DOI 10.1016/j.neucom.2014.12.086Subsampling is used to generate bagging ensembles that are accurate and robust to class-label noise. The effect of using smaller bootstrap samples to train the base learners is to make the ensemble more diverse. As a result, the classification margins tend to decrease. In spite of having small margins, these ensembles can be robust to class-label noise. The validity of these observations is illustrated in a wide range of synthetic and real-world classification tasks. In the problems investigated, subsampling significantly outperforms standard bagging for different amounts of class-label noise. By contrast, the effectiveness of subsampling in random forest is problem dependent. In these types of ensembles the best overall accuracy is obtained when the random trees are built on bootstrap samples of the same size as the original training data. Nevertheless, subsampling becomes more effective as the amount of class-label noise increases.The authors acknowledge financial support from Spanish Plan Nacional I+D+i Grant TIN2013-42351-P and from Comunidad de Madrid Grant S2013/ICE-2845 CASI-CAM-CM

    Building well-performing classifier ensembles: model and decision level combination.

    Get PDF
    There is a continuing drive for better, more robust generalisation performance from classification systems, and prediction systems in general. Ensemble methods, or the combining of multiple classifiers, have become an accepted and successful tool for doing this, though the reasons for success are not always entirely understood. In this thesis, we review the multiple classifier literature and consider the properties an ensemble of classifiers - or collection of subsets - should have in order to be combined successfully. We find that the framework of Stochastic Discrimination provides a well-defined account of these properties, which are shown to be strongly encouraged in a number of the most popular/successful methods in the literature via differing algorithmic devices. This uncovers some interesting and basic links between these methods, and aids understanding of their success and operation in terms of a kernel induced on the training data, with form particularly well suited to classification. One property that is desirable in both the SD framework and in a regression context, the ambiguity decomposition of the error, is de-correlation of individuals. This motivates the introduction of the Negative Correlation Learning method, in which neural networks are trained in parallel in a way designed to encourage de-correlation of the individual networks. The training is controlled by a parameter λ governing the extent to which correlations are penalised. Theoretical analysis of the dynamics of training results in an exact expression for the interval in which we can choose λ while ensuring stability of the training, and a value λ∗ for which the training has some interesting optimality properties. These values depend only on the size N of the ensemble. Decision level combination methods often result in a difficult to interpret model, and NCL is no exception. However in some applications, there is a need for understandable decisions and interpretable models. In response to this, we depart from the standard decision level combination paradigm to introduce a number of model level combination methods. As decision trees are one of the most interpretable model structures used in classification, we chose to combine structure from multiple individual trees to build a single combined model. We show that extremely compact, well performing models can be built in this way. In particular, a generalisation of bottom-up pruning to a multiple-tree context produces good results in this regard. Finally, we develop a classification system for a real-world churn prediction problem, illustrating some of the concepts introduced in the thesis, and a number of more practical considerations which are of importance when developing a prediction system for a specific problem
    • …