4,379 research outputs found
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial
In this tutorial paper, we first define mean squared error, variance,
covariance, and bias of both random variables and classification/predictor
models. Then, we formulate the true and generalization errors of the model for
both training and validation/test instances where we make use of the Stein's
Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and
generalization using the obtained true and generalization errors. We introduce
cross validation and two well-known examples which are -fold and
leave-one-out cross validations. We briefly introduce generalized cross
validation and then move on to regularization where we use the SURE again. We
work on both and norm regularizations. Then, we show that
bootstrap aggregating (bagging) reduces the variance of estimation. Boosting,
specifically AdaBoost, is introduced and it is explained as both an additive
model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper
bound on the generalization error of boosting is also provided to show why
boosting prevents from overfitting. As examples of regularization, the theory
of ridge and lasso regressions, weight decay, noise injection to input/weights,
and early stopping are explained. Random forest, dropout, histogram of oriented
gradients, and single shot multi-box detector are explained as examples of
bagging in machine learning and computer vision. Finally, boosting tree and SVM
models are mentioned as examples of boosting.Comment: 23 pages, 9 figure
Tackling Uncertainties and Errors in the Satellite Monitoring of Forest Cover Change
This study aims at improving the reliability of automatic forest change detection. Forest change detection is of vital importance for understanding global land cover as well as the carbon cycle. Remote sensing and machine learning have been widely adopted for such studies with increasing degrees of success. However, contemporary global studies still suffer from lower-than-satisfactory accuracies and robustness problems whose causes were largely unknown.
Global geographical observations are complex, as a result of the hidden interweaving geographical processes. Is it possible that some geographical complexities were not expected in contemporary machine learning? Could they cause uncertainties and errors when contemporary machine learning theories are applied for remote sensing?
This dissertation adopts the philosophy of error elimination. We start by explaining the mathematical origins of possible geographic uncertainties and errors in chapter two. Uncertainties are unavoidable but might be mitigated. Errors are hidden but might be found and corrected. Then in chapter three, experiments are specifically designed to assess whether or not the contemporary machine learning theories can handle these geographic uncertainties and errors. In chapter four, we identify an unreported systemic error source: the proportion distribution of classes in the training set. A subsequent Bayesian Optimal solution is designed to combine Support Vector Machine and Maximum Likelihood. Finally, in chapter five, we demonstrate how this type of error is widespread not just in classification algorithms, but also embedded in the conceptual definition of geographic classes before the classification. In chapter six, the sources of errors and uncertainties and their solutions are summarized, with theoretical implications for future studies.
The most important finding is that, how we design a classification largely pre-determines what we eventually get out of it. This applies for many contemporary popular classifiers including various types of neural nets, decision tree, and support vector machine. This is a cause of the so-called overfitting problem in contemporary machine learning. Therefore, we propose that the emphasis of classification work be shifted to the planning stage before the actual classification. Geography should not just be the analysis of collected observations, but also about the planning of observation collection. This is where geography, machine learning, and survey statistics meet
Combining Prior Knowledge and Data: Beyond the Bayesian Framework
For many tasks such as text categorization and control of robotic systems, state-of-the art learning systems can produce results comparable in accuracy to those of human subjects. However, the amount of training data needed for such systems can be prohibitively large for many practical problems. A text categorization system, for example, may need to see many text postings manually tagged with their subjects before it learns to predict the subject of the next posting with high accuracy. A reinforcement learning (RL) system learning how to drive a car needs a lot of experimentation with the actual car before acquiring the optimal policy. An optimizing compiler targeting a certain platform has to construct, compile, and execute many versions of the same code with different optimization parameters to determine which optimizations work best. Such extensive sampling can be time-consuming, expensive (in terms of both expense of the human expertise needed to label data and wear and tear on the robotic equipment used for exploration in case of RL), and sometimes dangerous (e.g., an RL agent driving the car off the cliff to see if it survives the crash). The goal of this work is to reduce the amount of training data an agent needs in order to learn how to perform a task successfully. This is done by providing the system with prior knowledge about its domain. The knowledge is used to bias the agent towards useful solutions and limit the amount of training needed.
We explore this task in three contexts: classification (determining the subject of a newsgroup posting), control (learning to perform tasks such as driving a car up the mountain in simulation), and optimization (optimizing performance of linear algebra operations on different hardware platforms). For the text categorization problem, we introduce a novel algorithm which efficiently integrates prior knowledge into large margin classification. We show that prior knowledge simplifies the problem by reducing the size of the hypothesis space. We also provide formal convergence guarantees for our algorithm. For reinforcement learning, we introduce a novel framework for defining planning problems in terms of qualitative statements about the world (e.g., ``the faster the car is going, the more likely it is to reach the top of the mountain''). We present an algorithm based on policy iteration for solving such qualitative problems and prove its convergence. We also present an alternative framework which allows the user to specify prior knowledge quantitatively in form of a Markov Decision Process (MDP). This prior is used to focus exploration on those regions of the world in which the optimal policy is most sensitive to perturbations in transition probabilities and rewards. Finally, in the compiler optimization problem, the prior is based on an analytic model which determines good optimization parameters for a given platform. This model defines a Bayesian prior which, combined with empirical samples (obtained by measuring the performance of optimized code segments), determines the maximum-a-posteriori estimate of the optimization parameters
Dynamic Bayesian Combination of Multiple Imperfect Classifiers
Classifier combination methods need to make best use of the outputs of
multiple, imperfect classifiers to enable higher accuracy classifications. In
many situations, such as when human decisions need to be combined, the base
decisions can vary enormously in reliability. A Bayesian approach to such
uncertain combination allows us to infer the differences in performance between
individuals and to incorporate any available prior knowledge about their
abilities when training data is sparse. In this paper we explore Bayesian
classifier combination, using the computationally efficient framework of
variational Bayesian inference. We apply the approach to real data from a large
citizen science project, Galaxy Zoo Supernovae, and show that our method far
outperforms other established approaches to imperfect decision combination. We
go on to analyse the putative community structure of the decision makers, based
on their inferred decision making strategies, and show that natural groupings
are formed. Finally we present a dynamic Bayesian classifier combination
approach and investigate the changes in base classifier performance over time.Comment: 35 pages, 12 figure
Population structure-learned classifier for high-dimension low-sample-size class-imbalanced problem
The Classification on high-dimension low-sample-size data (HDLSS) is a
challenging problem and it is common to have class-imbalanced data in most
application fields. We term this as Imbalanced HDLSS (IHDLSS). Recent
theoretical results reveal that the classification criterion and tolerance
similarity are crucial to HDLSS, which emphasizes the maximization of
within-class variance on the premise of class separability. Based on this idea,
a novel linear binary classifier, termed Population Structure-learned
Classifier (PSC), is proposed. The proposed PSC can obtain better
generalization performance on IHDLSS by maximizing the sum of inter-class
scatter matrix and intra-class scatter matrix on the premise of class
separability and assigning different intercept values to majority and minority
classes. The salient features of the proposed approach are: (1) It works well
on IHDLSS; (2) The inverse of high dimensional matrix can be solved in low
dimensional space; (3) It is self-adaptive in determining the intercept term
for each class; (4) It has the same computational complexity as the SVM. A
series of evaluations are conducted on one simulated data set and eight
real-world benchmark data sets on IHDLSS on gene analysis. Experimental results
demonstrate that the PSC is superior to the state-of-art methods in IHDLSS.Comment: 41 pages,10 Figures,10 Table
Minimum Density Hyperplanes
Associating distinct groups of objects (clusters) with contiguous regions of
high probability density (high-density clusters), is central to many
statistical and machine learning approaches to the classification of unlabelled
data. We propose a novel hyperplane classifier for clustering and
semi-supervised classification which is motivated by this objective. The
proposed minimum density hyperplane minimises the integral of the empirical
probability density function along it, thereby avoiding intersection with high
density clusters. We show that the minimum density and the maximum margin
hyperplanes are asymptotically equivalent, thus linking this approach to
maximum margin clustering and semi-supervised support vector classifiers. We
propose a projection pursuit formulation of the associated optimisation problem
which allows us to find minimum density hyperplanes efficiently in practice,
and evaluate its performance on a range of benchmark datasets. The proposed
approach is found to be very competitive with state of the art methods for
clustering and semi-supervised classification
kLog: A Language for Logical and Relational Learning with Kernels
We introduce kLog, a novel approach to statistical relational learning.
Unlike standard approaches, kLog does not represent a probability distribution
directly. It is rather a language to perform kernel-based learning on
expressive logical and relational representations. kLog allows users to specify
learning problems declaratively. It builds on simple but powerful concepts:
learning from interpretations, entity/relationship data modeling, logic
programming, and deductive databases. Access by the kernel to the rich
representation is mediated by a technique we call graphicalization: the
relational representation is first transformed into a graph --- in particular,
a grounded entity/relationship diagram. Subsequently, a choice of graph kernel
defines the feature space. kLog supports mixed numerical and symbolic data, as
well as background knowledge in the form of Prolog or Datalog programs as in
inductive logic programming systems. The kLog framework can be applied to
tackle the same range of tasks that has made statistical relational learning so
popular, including classification, regression, multitask learning, and
collective classification. We also report about empirical comparisons, showing
that kLog can be either more accurate, or much faster at the same level of
accuracy, than Tilde and Alchemy. kLog is GPLv3 licensed and is available at
http://klog.dinfo.unifi.it along with tutorials
- …