14,625 research outputs found
Supervised Machine Learning Under Test-Time Resource Constraints: A Trade-off Between Accuracy and Cost
The past decade has witnessed how the field of machine learning has established itself as a necessary component in several multi-billion-dollar industries. The real-world industrial setting introduces an interesting new problem to machine learning research: computational resources must be budgeted and cost must be strictly accounted for during test-time. A typical problem is that if an application consumes x additional units of cost during test-time, but will improve accuracy by y percent, should the additional x resources be allocated? The core of this problem is a trade-off between accuracy and cost. In this thesis, we examine components of test-time cost, and develop different strategies to manage this trade-off.
We first investigate test-time cost and discover that it typically consists of two parts: feature extraction cost and classifier evaluation cost. The former reflects the computational efforts of transforming data instances to feature vectors, and could be highly variable when features are heterogeneous. The latter reflects the effort of evaluating a classifier, which could be substantial, in particular nonparametric algorithms. We then propose three strategies to explicitly trade-off accuracy and the two components of test-time cost during classifier training.
To budget the feature extraction cost, we first introduce two algorithms: GreedyMiser and Anytime Representation Learning (AFR). GreedyMiser employs a strategy that incorporates the extraction cost information during classifier training to explicitly minimize the test-time cost. AFR extends GreedyMiser to learn a cost-sensitive feature representation rather than a classifier, and turns traditional Support Vector Machines (SVM) into test- time cost-sensitive anytime classifiers. GreedyMiser and AFR are evaluated on two real-world data sets from two different application domains, and both achieve record performance.
We then introduce Cost Sensitive Tree of Classifiers (CSTC) and Cost Sensitive Cascade of Classifiers (CSCC), which share a common strategy that trades-off the accuracy and the amortized test-time cost. CSTC introduces a tree structure and directs test inputs along different tree traversal paths, each is optimized for a specific sub-partition of the input space, extracting different, specialized subsets of features. CSCC extends CSTC and builds a linear cascade, instead of a tree, to cope with class-imbalanced binary classification tasks. Since both CSTC and CSCC extract different features for different inputs, the amortized test-time cost is greatly reduced while maintaining high accuracy. Both approaches out-perform the current state-of-the-art on real-world data sets.
To trade-off accuracy and high classifier evaluation cost of nonparametric classifiers, we propose a model compression strategy and develop Compressed Vector Machines (CVM). CVM focuses on the nonparametric kernel Support Vector Machines (SVM), whose test-time evaluation cost is typically substantial when learned from large training sets. CVM is a post-processing algorithm which compresses the learned SVM model by reducing and optimizing support vectors. On several benchmark data sets, CVM maintains high test accuracy while reducing the test-time evaluation cost by several orders of magnitude
Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction
We study the problem of reducing test-time acquisition costs in
classification systems. Our goal is to learn decision rules that adaptively
select sensors for each example as necessary to make a confident prediction. We
model our system as a directed acyclic graph (DAG) where internal nodes
correspond to sensor subsets and decision functions at each node choose whether
to acquire a new sensor or classify using the available measurements. This
problem can be naturally posed as an empirical risk minimization over training
data. Rather than jointly optimizing such a highly coupled and non-convex
problem over all decision nodes, we propose an efficient algorithm motivated by
dynamic programming. We learn node policies in the DAG by reducing the global
objective to a series of cost sensitive learning problems. Our approach is
computationally efficient and has proven guarantees of convergence to the
optimal system for a fixed architecture. In addition, we present an extension
to map other budgeted learning problems with large number of sensors to our DAG
architecture and demonstrate empirical performance exceeding state-of-the-art
algorithms for data composed of both few and many sensors.Comment: To appear in NIPS 201
Sensor Selection by Linear Programming
We learn sensor trees from training data to minimize sensor acquisition costs
during test time. Our system adaptively selects sensors at each stage if
necessary to make a confident classification. We pose the problem as empirical
risk minimization over the choice of trees and node decision rules. We
decompose the problem, which is known to be intractable, into combinatorial
(tree structures) and continuous parts (node decision rules) and propose to
solve them separately. Using training data we greedily solve for the
combinatorial tree structures and for the continuous part, which is a
non-convex multilinear objective function, we derive convex surrogate loss
functions that are piecewise linear. The resulting problem can be cast as a
linear program and has the advantage of guaranteed convergence, global
optimality, repeatability and computational efficiency. We show that our
proposed approach outperforms the state-of-art on a number of benchmark
datasets
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
For large, real-world inductive learning problems, the number of training
examples often must be limited due to the costs associated with procuring,
preparing, and storing the training examples and/or the computational costs
associated with learning from them. In such circumstances, one question of
practical importance is: if only n training examples can be selected, in what
proportion should the classes be represented? In this article we help to answer
this question by analyzing, for a fixed training-set size, the relationship
between the class distribution of the training data and the performance of
classification trees induced from these data. We study twenty-six data sets
and, for each, determine the best class distribution for learning. The
naturally occurring class distribution is shown to generally perform well when
classifier performance is evaluated using undifferentiated error rate (0/1
loss). However, when the area under the ROC curve is used to evaluate
classifier performance, a balanced distribution is shown to perform well. Since
neither of these choices for class distribution always generates the
best-performing classifier, we introduce a budget-sensitive progressive
sampling algorithm for selecting training examples based on the class
associated with each example. An empirical analysis of this algorithm shows
that the class distribution of the resulting training set yields classifiers
with good (nearly-optimal) classification performance
Classification of Imbalanced Data with a Geometric Digraph Family
We use a geometric digraph family called class cover catch digraphs (CCCDs)
to tackle the class imbalance problem in statistical classification. CCCDs
provide graph theoretic solutions to the class cover problem and have been
employed in classification. We assess the classification performance of CCCD
classifiers by extensive Monte Carlo simulations, comparing them with other
classifiers commonly used in the literature. In particular, we show that CCCD
classifiers perform relatively well when one class is more frequent than the
other in a two-class setting, an example of the class imbalance problem. We
also point out the relationship between class imbalance and class overlapping
problems, and their influence on the performance of CCCD classifiers and other
classification methods as well as some state-of-the-art algorithms which are
robust to class imbalance by construction. Experiments on both simulated and
real data sets indicate that CCCD classifiers are robust to the class imbalance
problem. CCCDs substantially undersample from the majority class while
preserving the information on the discarded points during the undersampling
process. Many state-of-the-art methods, however, keep this information by means
of ensemble classifiers, but CCCDs yield only a single classifier with the same
property, making it both appealing and fast
Benchmark of structured machine learning methods for microbial identification from mass-spectrometry data
Microbial identification is a central issue in microbiology, in particular in
the fields of infectious diseases diagnosis and industrial quality control. The
concept of species is tightly linked to the concept of biological and clinical
classification where the proximity between species is generally measured in
terms of evolutionary distances and/or clinical phenotypes. Surprisingly, the
information provided by this well-known hierarchical structure is rarely used
by machine learning-based automatic microbial identification systems.
Structured machine learning methods were recently proposed for taking into
account the structure embedded in a hierarchy and using it as additional a
priori information, and could therefore allow to improve microbial
identification systems. We test and compare several state-of-the-art machine
learning methods for microbial identification on a new Matrix-Assisted Laser
Desorption/Ionization Time-of-Flight mass spectrometry (MALDI-TOF MS) dataset.
We include in the benchmark standard and structured methods, that leverage the
knowledge of the underlying hierarchical structure in the learning process. Our
results show that although some methods perform better than others, structured
methods do not consistently perform better than their "flat" counterparts. We
postulate that this is partly due to the fact that standard methods already
reach a high level of accuracy in this context, and that they mainly confuse
species close to each other in the tree, a case where using the known hierarchy
is not helpful
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
Soft Methodology for Cost-and-error Sensitive Classification
Many real-world data mining applications need varying cost for different
types of classification errors and thus call for cost-sensitive classification
algorithms. Existing algorithms for cost-sensitive classification are
successful in terms of minimizing the cost, but can result in a high error rate
as the trade-off. The high error rate holds back the practical use of those
algorithms. In this paper, we propose a novel cost-sensitive classification
methodology that takes both the cost and the error rate into account. The
methodology, called soft cost-sensitive classification, is established from a
multicriteria optimization problem of the cost and the error rate, and can be
viewed as regularizing cost-sensitive classification with the error rate. The
simple methodology allows immediate improvements of existing cost-sensitive
classification algorithms. Experiments on the benchmark and the real-world data
sets show that our proposed methodology indeed achieves lower test error rates
and similar (sometimes lower) test costs than existing cost-sensitive
classification algorithms. We also demonstrate that the methodology can be
extended for considering the weighted error rate instead of the original error
rate. This extension is useful for tackling unbalanced classification problems.Comment: A shorter version appeared in KDD '1
The Impact of Automated Parameter Optimization on Defect Prediction Models
Defect prediction models---classifiers that identify defect-prone software
modules---have configurable parameters that control their characteristics
(e.g., the number of trees in a random forest). Recent studies show that these
classifiers underperform when default settings are used. In this paper, we
study the impact of automated parameter optimization on defect prediction
models. Through a case study of 18 datasets, we find that automated parameter
optimization: (1) improves AUC performance by up to 40 percentage points; (2)
yields classifiers that are at least as stable as those trained using default
settings; (3) substantially shifts the importance ranking of variables, with as
few as 28% of the top-ranked variables in optimized classifiers also being
top-ranked in non-optimized classifiers; (4) yields optimized settings for 17
of the 20 most sensitive parameters that transfer among datasets without a
statistically significant drop in performance; and (5) adds less than 30
minutes of additional computation to 12 of the 26 studied classification
techniques. While widely-used classification techniques like random forest and
support vector machines are not optimization-sensitive, traditionally
overlooked techniques like C5.0 and neural networks can actually outperform
widely-used techniques after optimization is applied. This highlights the
importance of exploring the parameter space when using parameter-sensitive
classification techniques.Comment: 32 pages, accepted at IEEE Transactions on Software Engineerin
- …