408 research outputs found
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
For large, real-world inductive learning problems, the number of training
examples often must be limited due to the costs associated with procuring,
preparing, and storing the training examples and/or the computational costs
associated with learning from them. In such circumstances, one question of
practical importance is: if only n training examples can be selected, in what
proportion should the classes be represented? In this article we help to answer
this question by analyzing, for a fixed training-set size, the relationship
between the class distribution of the training data and the performance of
classification trees induced from these data. We study twenty-six data sets
and, for each, determine the best class distribution for learning. The
naturally occurring class distribution is shown to generally perform well when
classifier performance is evaluated using undifferentiated error rate (0/1
loss). However, when the area under the ROC curve is used to evaluate
classifier performance, a balanced distribution is shown to perform well. Since
neither of these choices for class distribution always generates the
best-performing classifier, we introduce a budget-sensitive progressive
sampling algorithm for selecting training examples based on the class
associated with each example. An empirical analysis of this algorithm shows
that the class distribution of the resulting training set yields classifiers
with good (nearly-optimal) classification performance
Forgetting Exceptions is Harmful in Language Learning
We show that in language learning, contrary to received wisdom, keeping
exceptional training instances in memory can be beneficial for generalization
accuracy. We investigate this phenomenon empirically on a selection of
benchmark natural language processing tasks: grapheme-to-phoneme conversion,
part-of-speech tagging, prepositional-phrase attachment, and base noun phrase
chunking. In a first series of experiments we combine memory-based learning
with training set editing techniques, in which instances are edited based on
their typicality and class prediction strength. Results show that editing
exceptional instances (with low typicality or low class prediction strength)
tends to harm generalization accuracy. In a second series of experiments we
compare memory-based learning and decision-tree learning methods on the same
selection of tasks, and find that decision-tree learning often performs worse
than memory-based learning. Moreover, the decrease in performance can be linked
to the degree of abstraction from exceptions (i.e., pruning or eagerness). We
provide explanations for both results in terms of the properties of the natural
language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex
styles. Pre-print version of article to appear in Machine Learning 11:1-3,
Special Issue on Natural Language Learning. Figures on page 22 slightly
compressed to avoid page overloa
Machine Learning from Imbalanced Data Sets 101
Invited paper for the AAAI'2000 Workshop on Imbalanced Data Sets.For research to progress most effectively, we first should
establish common ground regarding just what is the problem that
imbalanced data sets present to machine learning systems. Why
and when should imbalanced data sets be problematic? When is
the problem simply an artifact of easily rectified design choices? I
will try to pick the low-hanging fruit and share them with the rest
of the workshop participants. Specifically, I would like to
discuss what the problem is not. I hope this will lead to a
profitable discussion of what the problem indeed is, and how it
might be addressed most effectively.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Prediction in Financial Markets: The Case for Small Disjuncts
Predictive models in regression and classification problems typically
have a single model that covers most, if not all, cases in the data. At
the opposite end of the spectrum is a collection of models each of which
covers a very small subset of the decision space. These are referred to
as “small disjuncts.” The tradeoffs between the two types of
models have been well documented. Single models, especially linear ones,
are easy to interpret and explain. In contrast, small disjuncts do not
provide as clean or as simple an interpretation of the data, and have
been shown by several researchers to be responsible for a
disproportionately large number of errors when applied to out of sample
data. This research provides a counterpoint, demonstrating that
“simple” small disjuncts provide a credible model for
financial market prediction, a problem with a high degree of noise. A
related novel contribution of this paper is a simple method for
measuring the “yield” of a learning system, which is the
percentage of in sample performance that the learned model can be
expected to realize on out-of-sample data. Curiously, such a measure is
missing from the literature on regression learning algorithms.NYU Stern School of Busines
Rule-based Machine Learning Methods for Functional Prediction
We describe a machine learning method for predicting the value of a
real-valued function, given the values of multiple input variables. The method
induces solutions from samples in the form of ordered disjunctive normal form
(DNF) decision rules. A central objective of the method and representation is
the induction of compact, easily interpretable solutions. This rule-based
decision model can be extended to search efficiently for similar cases prior to
approximating function values. Experimental results on real-world data
demonstrate that the new techniques are competitive with existing machine
learning and statistical methods and can sometimes yield superior regression
performance.Comment: See http://www.jair.org/ for any accompanying file
a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets
Building accurate classifiers for predicting group membership is made difficult when data is skewed or imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high. Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in many domains including fraud detection, mammography of cancer and post term births. These rare events are usually the most costly or have the highest level of risk associated with them and are therefore of most interest. To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factors that perpetuate the issue of misclassification in class imbalanced data. However, there remains a lack of understanding between the relationship of the learner and imbalanced data among the competing approaches. The current landscape of data preprocessing approaches have appeal due to the ability to divide the problem space in two which allows for simpler models. However, most of these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement. The main goals of this research is to introduce newly proposed a priori based re-sampling methods that improve concept learning within class imbalanced data. The results in this work highlight the robustness of these techniques performance within publicly available data sets from different domains containing various levels of imbalance. In this research the theoretical and empirical reasons are explored and discussed
Training and assessing classification rules with unbalanced data
The problem of modeling binary responses by using cross-sectional data has been addressed
with a number of satisfying solutions that draw on both parametric and nonparametric
methods. However, there exist many real situations where one of the two responses (usually
the most interesting for the analysis) is rare. It has been largely reported that this class
imbalance heavily compromises the process of learning, because the model tends to focus on
the prevalent class and to ignore the rare events. However, not only the estimation of the
classification model is affected by a skewed distribution of the classes, but also the evaluation
of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the
model’s accuracy.
In this work, the effects of class imbalance on model training and model assessing are
discussed. Moreover, a unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap re-sampling technique
- …