Search CORE

5 research outputs found

Predicting sample size required for classification performance

Author: A Vlachos
AH Briggs
AV Carneiro
C Cortes
CJ Adcock
F Olsson
F Provost
HM Kalayeh
I Scheinin
J Algina
J Cai
J Cohen
J Eng
J Yuan
J Zhu
JE Dennis
K Brinker
K Dobbin
K Fukunaga
K Nigam
KR Hess
LE Yelle
Long H Ngo
M Last
M Li
MK Warmuth
MR Jiroutek
N Boonyanunta
Qing Zeng-Treitler
RL Figueroa
Rosa L Figueroa
RV Lenth
S Kandula
S Mukherjee
S Tong
S-Y Kim
Sasikiran Kandula
SE Maxwell
SJ Walters
SL Beal
V Stalbovskaya
VH Tam
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target. Methods We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method. Results A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05). Conclusions This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.</p

Crossref

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Improving the predictive power of AdaBoost: a case study in classifying borrowers

Author: Boonyanunta N
Zeephongsekul P
Publication venue: Springer (New York, USA)
Publication date: 01/01/2003
Field of study

Boosting is one of the recent major developments in classification methods. The technique works by creating different versions of a classifier using an adaptive resampling procedure and then combining these classifiers using weighted voting. In this paper, several modifications of the original version of boosting, the AdaBoost algorithm introduced by Y. Freund and R.E. Schapire in 1996, will be explained. These will be shown to substantially improve the predictive power of the original version. In the first modification, weighted error estimation in AdaBoost is replaced by unweighted error estimation and this is designed to reduce the impact of observations that possess large weight. In the second modification, only a selection of base classifiers, i.e. those that contribute significantly to predictive power of the boosting model, will be included in the final model. In addition to these two modifications, we will also utilise different classification techniques as base classifiers in order to product a final boosting model. Applying these proposed modifications to three data sets from the banking industry provides results which indicate a significant and substantial improvement in predictive power over the original AdaBoost algorithm

RMIT Research Repository

Predicting the relationship between the size of training sample and the predictive power of classifiers

Author: Boonyanunta N
Zeephongsekul P
Publication venue: Springer (Berlin)
Publication date: 01/01/2004
Field of study

The main objective of this paper is to investigate the relationship between the size of training sample and the predictive power of well-known classification techniques. We first display this relationship using the results of some empirical studies and then propose a general mathematical model which can explain this relationship. Next, we validate this model on some real data sets and found that the model provides a good fit to the data. This model also allow a more objective determination of optimum training sample size in contrast to current training sample size selection approaches which tend to be ad hoc or subjective

RMIT Research Repository

Accuracy of Neural Network Classifiers as a Property of the Size of the Data Set

Author: H.A. Guvenir
N. Boonyanunta
O.L. Mangasarian
P. Crowther
P.S. Crowther
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2006
Field of study

Crossref

University of Canberra Research Repository

SEffEst: Effort estimation in software projects using fuzzy logic and neural networks

Crossref