188 research outputs found
PAC-Bayesian Domain Adaptation Bounds for Multiclass Learners
Multiclass neural networks are a common tool in modern unsupervised domain
adaptation, yet an appropriate theoretical description for their non-uniform
sample complexity is lacking in the adaptation literature. To fill this gap, we
propose the first PAC-Bayesian adaptation bounds for multiclass learners. We
facilitate practical use of our bounds by also proposing the first
approximation techniques for the multiclass distribution divergences we
consider. For divergences dependent on a Gibbs predictor, we propose additional
PAC-Bayesian adaptation bounds which remove the need for inefficient
Monte-Carlo estimation. Empirically, we test the efficacy of our proposed
approximation techniques as well as some novel design-concepts which we include
in our bounds. Finally, we apply our bounds to analyze a common adaptation
algorithm that uses neural networks
Le Cam meets LeCun: Deficiency and Generic Feature Learning
"Deep Learning" methods attempt to learn generic features in an unsupervised
fashion from a large unlabelled data set. These generic features should perform
as well as the best hand crafted features for any learning problem that makes
use of this data. We provide a definition of generic features, characterize
when it is possible to learn them and provide methods closely related to the
autoencoder and deep belief network of deep learning. In order to do so we use
the notion of deficiency and illustrate its value in studying certain general
learning problems.Comment: 25 pages, 2 figure
Composite multiclass losses
We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a âproper composite lossâ, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on âclassification calibrationâ by relating it to properness. We determine the stationarity condition, Bregman representation, order-sensitivity, and quasi-convexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation formulti class losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk. We show that the simple integral representation for binary proper losses can not be extended to multiclass losses but offer concrete guidance regarding how to design different loss functions. The conclusion drawn from these results is that the proper composite representation is a natural and convenient tool for the design of multiclass loss functions
Information Processing Equalities and the Information-Risk Bridge
We introduce two new classes of measures of information for statistical
experiments which generalise and subsume -divergences, integral
probability metrics, -distances (MMD), and
divergences between two or more distributions. This enables us to derive a
simple geometrical relationship between measures of information and the Bayes
risk of a statistical decision problem, thus extending the variational
-divergence representation to multiple distributions in an entirely
symmetric manner. The new families of divergence are closed under the action of
Markov operators which yields an information processing equality which is a
refinement and generalisation of the classical data processing inequality. This
equality gives insight into the significance of the choice of the hypothesis
class in classical risk minimization.Comment: 48 pages; corrected some typos and added a few additional
explanation
Cost-sensitive classification based on Bregman divergences
The main object of this PhD. Thesis is the identification, characterization and
study of new loss functions to address the so-called cost-sensitive classification. Many
decision problems are intrinsically cost-sensitive. However, the dominating preference
for cost-insensitive methods in the machine learning literature is a natural consequence
of the fact that true costs in real applications are di fficult to evaluate.
Since, in general, uncovering the correct class of the data is less costly than any
decision error, designing low error decision systems is a reasonable (but suboptimal)
approach. For instance, consider the classification of credit applicants as either being good customers (will pay back the credit) or bad customers (will fail to pay o part of the credit). The cost of classifying one risky borrower as good could be much higher than the cost of classifying a potentially good customer as bad.
Our proposal relies on Bayes decision theory where the goal is to assign instances
to the class with minimum expected cost. The decision is made involving both costs and posterior probabilities of the classes. Obtaining calibrated probability
estimates at the classifier output requires a suitable learning machine, a large enough
representative data set as well as an adequate loss function to be minimized during
learning. The design of the loss function can be aided by the costs: classical decision
theory shows that cost matrices de ne class boundaries determined by posterior class
probability estimates. Strictly speaking, in order to make optimal decisions, accurate
probability estimates are only required near the decision boundaries. It is key to
point out that the election of the loss function becomes especially relevant when
the prior knowledge about the problem is limited or the available training examples
are somehow unsuitable. In those cases, different loss functions lead to dramatically
different posterior probabilities estimates. We focus our study on the set of Bregman
divergences. These divergences offer a rich family of proper losses that has recently
become very popular in the machine learning community [Nock and Nielsen, 2009,
Reid and Williamson, 2009a].
The first part of the Thesis deals with the development of a novel parametric family of multiclass Bregman divergences which captures the information in the cost
matrix, so that the loss function is adapted to each specific problem. Multiclass costsensitive learning is one of the main challenges in cost-sensitive learning and, through this parametric family, we provide a natural framework to successfully overcome
binary tasks. Following this idea, two lines are explored:
Cost-sensitive supervised classification: We derive several asymptotic results.
The first analysis guarantees that the proposed Bregman divergence has maximum sensitivity to changes at probability vectors near the decision regions. Further analysis shows that the optimization of this Bregman divergence becomes equivalent to minimizing the overall cost regret in non-separable problems, and to maximizing a margin in separable problems.
Cost-sensitive semi-supervised classification: When labeled data is
scarce but unlabeled data is widely available, semi-supervised learning is an
useful tool to make the most of the unlabeled data. We discuss an optimization
problem relying on the minimization of our parametric family of Bregman divergences, using both labeled and unlabeled data, based on what is called the Entropy Minimization principle. We propose the rst multiclass cost-sensitive semi-supervised algorithm, under the assumption that inter-class separation is stronger than intra-class separation.
The second part of the Thesis deals with the transformation of this parametric family of Bregman divergences into a sequence of Bregman divergences. Work along this line can be further divided into two additional areas:
Foundations of sequences of Bregman divergences: We generalize some
previous results about the design and characterization of Bregman divergences
that are suitable for learning and their relationship with convexity. In addition,
we aim to broaden the subset of Bregman divergences that are interesting for
cost-sensitive learning. Under very general conditions, we nd sequences of (cost-sensitive) Bregman divergences, whose minimization provides minimum (cost-sensitive) risk for non-separable problems and some type of maximum margin classifiers in separable cases.
Learning with example-dependent costs: A strong assumption is widespread through most cost-sensitive learning algorithms: misclassification costs are the same for all examples. In many cases this statement is not true.
We claim that using the example-dependent costs directly is more natural and will lead to the production of more accurate classifiers. For these reasons, we consider the extension of cost-sensitive sequences of Bregman losses to example-dependent cost scenarios to generate finely tuned posterior probability estimates
- âŠ