Search CORE

935 research outputs found

Popular Ensemble Methods: An Empirical Study

Author: Maclin R.
Opitz D.
Publication venue: 'AI Access Foundation'
Publication date: 01/06/2011
Field of study

An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. Bagging (Breiman, 1996c) and Boosting (Freund and Shapire, 1996; Shapire, 1990) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods on 23 data sets using both neural networks and decision trees as our classification algorithm. Our results clearly indicate a number of conclusions. First, while Bagging is almost always more accurate than a single classifier, it is sometimes much less accurate than Boosting. On the other hand, Boosting can create ensembles that are less accurate than a single classifier -- especially when using neural networks. Analysis indicates that the performance of the Boosting methods is dependent on the characteristics of the data set being examined. In fact, further results show that Boosting ensembles may overfit noisy data sets, thus decreasing its performance. Finally, consistent with previous studies, our work suggests that most of the gain in an ensemble's performance comes in the first few classifiers combined; however, relatively large gains can be seen up to 25 classifiers when Boosting decision trees

arXiv.org e-Print Archive

Crossref

Low-Default Portfolio/One-Class Classification: A Literature Review

Author: Delany Sarah Jane
Kennedy Kenneth
Mac Namee Brian
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2009
Field of study

Consider a bank which wishes to decide whether a credit applicant will obtain credit or not. The bank has to assess if the applicant will be able to redeem the credit. This is done by estimating the probability that the applicant will default prior to the maturity of the credit. To estimate this probability of default it is first necessary to identify criteria which separate the good from the bad creditors, such as loan amount and age or factors concerning the income of the applicant. The question then arises of how a bank identifies a sufficient number of selective criteria that possess the necessary discriminatory power. As a solution, many traditional binary classification methods have been proposed with varying degrees of success. However, a particular problem with credit scoring is that defaults are only observed for a small subsample of applicants. An imbalance exists between the ratio of non-defaulters to defaulters. This has an adverse effect on the aforementioned binary classification method. Recently one-class classification approaches have been proposed to address the imbalance problem. The purpose of this literature review is three fold: (I) present the reader with an overview of credit scoring; (ii) review existing binary classification approaches; and (iii) introduce and examine one-class classification approaches

Arrow@TUDublin

Principal Boundary on Riemannian Manifolds

Author: Yao Zhigang
Zhang Zhenyue
Publication venue
Publication date: 30/03/2019
Field of study

We consider the classification problem and focus on nonlinear methods for classification on manifolds. For multivariate datasets lying on an embedded nonlinear Riemannian manifold within the higher-dimensional ambient space, we aim to acquire a classification boundary for the classes with labels, using the intrinsic metric on the manifolds. Motivated by finding an optimal boundary between the two classes, we invent a novel approach -- the principal boundary. From the perspective of classification, the principal boundary is defined as an optimal curve that moves in between the principal flows traced out from two classes of data, and at any point on the boundary, it maximizes the margin between the two classes. We estimate the boundary in quality with its direction, supervised by the two principal flows. We show that the principal boundary yields the usual decision boundary found by the support vector machine in the sense that locally, the two boundaries coincide. Some optimality and convergence properties of the random principal boundary and its population counterpart are also shown. We illustrate how to find, use and interpret the principal boundary with an application in real data.Comment: 31 pages,10 figure

arXiv.org e-Print Archive

ScholarBank@NUS

FigShare

Comparing random forest and support vector machines for breast cancer classification

Author: Aroef Chelvian
Rivan Yuda
Rustam Zuherman
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/04/2020
Field of study

There are more than 100 types of cancer around the world with different symptoms and difficulty in predicting itsappearance in a person due to its random and sudden attack method. However, the appearance of cancer is generally marked by the growth of some abnormal cell. Someone might be diagnosed early and quickly treated, but the cancerous cell most times hides in the body of its victim and reappear, only to kill its sufferer. One of the most common cancers is breast cancer. According to Ministry of Health, in 2018, breast cancer attacked 42 out of every 100.000 people in Indonesia with approximately 17 deaths. In addition, the Ministry recorded a yearly increase in cancer patients. Therefore, there is adequate need to be able to determine those affected by this disease. This study applied the Boruta feature selection to determine the most important features in making a machine learning model. Furthermore, the Random Forest (RF) and Support Vector Machines (SVM) were the machine learning model used, with highest accuracies of 90% and 95% respectively. From the results obtained, the SVM is a better model than random forest in terms of accuracy

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Manifold Matching for High-Dimensional Pattern Recognition

Author: Seiji Hotta
Publication venue: 'IntechOpen'
Publication date: 01/11/2008
Field of study

IntechOpen

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Author: Aliferis Constantin F
Statnikov Alexander
Wang Lily
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A big data MapReduce framework for fault diagnosis in cloud-based manufacturing

Author: Ajay Kumar (192967)
Alok Choudhary (1251471)
Lakshman S. Thakur (7199684)
Ravi Shankar (103040)
Publication venue
Publication date: 04/03/2016
Field of study

This research develops a MapReduce framework for automatic pattern recognition based on fault diagnosis by solving data imbalance problem in a cloud-based manufacturing (CBM). Fault diagnosis in a CBM system significantly contributes to reduce the product testing cost and enhances manufacturing quality. One of the major challenges facing the big data analytics in cloud-based manufacturing is handling of datasets, which are highly imbalanced in nature due to poor classification result when machine learning techniques are applied on such datasets. The framework proposed in this research uses a hybrid approach to deal with big dataset for smarter decisions. Furthermore, we compare the performance of radial basis function based Support Vector Machine classifier with standard techniques. Our findings suggest that the most important task in cloud-based manufacturing, is to predict the effect of data errors on quality due to highly imbalance unstructured dataset. The proposed framework is an original contribution to the body of literature, where our proposed MapReduce framework has been used for fault detection by managing data imbalance problem appropriately and relating it to firm’s profit function. The experimental results are validated using a case study of steel plate manufacturing fault diagnosis, with crucial performance matrices such as accuracy, specificity and sensitivity. A comparative study shows that the methods used in the proposed framework outperform the traditional ones

Loughborough University Institutional Repository

Recommended from our members

A Novel Separating Hyperplane Classification Framework to Unify Nearest-class-model Methods for High-dimensional Data

Author: Fukui K.
Sogi N.
Wang Z.
Xue J-H.
Zhu R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

In this paper, we establish a novel separating hyperplane classification (SHC) framework to unify three nearest-classmodel methods for high-dimensional data: the nearest subspace method (NSM), the nearest convex hull method (NCHM) and the nearest convex cone method (NCCM). Nearest-class-model methods are an important paradigm for classification of highdimensional data. We first introduce the three nearest-classmodel methods and then conduct dual analysis for theoretically investigating them, to understand deeply their underlying classification mechanisms. A new theorem for the dual analysis of NCCM is proposed in this paper, through discovering the relationship between a convex cone and its polar cone. We then establish the new SHC framework to unify the nearest-classmodel methods based on the theoretical results. One important application of this new SHC framework is to help explain empirical classification results: why one class model has better performance than others on certain datasets. Finally, we propose a new nearest-class-model method, the soft NCCM, under the novel SHC framework to solve the overlapping class model problem. For illustrative purposes, we empirically demonstrate the significance of our SHC framework and the soft NCCM through two types of typical real-world high-dimensional data, the spectroscopic data and the face image data

City Research Online

UCL Discovery