8,081 research outputs found
The Aggregating Algorithm and Regression
Our main interest is in the problem of making predictions in the online mode of learning where at every step in time a signal arrives and a prediction needs to be made before the corresponding outcome arrives. Loss is suffered if the prediction and outcome do not match perfectly. In the prediction with expert advice framework, this protocol is augmented by a pool of experts that produce their predictions before we have to make ours. The Aggregating Algorithm (AA) is a technique that optimally merges these experts so that the resulting strategy suffers a cumulative loss that is almost as good as that of the best expert in the pool.
The AA was applied to the problem of regression, where outcomes are continuous real numbers, to get the AA for Regression (AAR) and its kernel version, KAAR. On typical datasets, KAAR’s empirical performance is not as good as that of Kernel Ridge Regression (KRR) which is a popular regression method. KAAR performs better than KRR only when the data is corrupted with lots of noise or contains severe outliers. To alleviate this we introduce methods that are a hybrid between KRR and KAAR. Empirical experiments suggest that, in general, these new methods perform as good as or better than both KRR and KAAR.
In the second part of this dissertation we deal with a more difficult problem— we allow the dependence of outcomes on signals to change with time. To handle this we propose two new methods: WeCKAAR and KAARCh. WeCKAAR is a simple modification of one of our methods from the first part of the dissertation to include decaying weights. KAARCh is an application of
the AA to the case where the experts are all the predictors that can change with time. We show that KAARCh suffers a cumulative loss that is almost as good as that of any expert that does not change very rapidly. Empirical results on data with changing dependencies demonstrate that WeCKAAR and KAARCh perform well in practice and are considerably better than Kernel Ridge Regression
Competing with Gaussian linear experts
We study the problem of online regression. We prove a theoretical bound on
the square loss of Ridge Regression. We do not make any assumptions about input
vectors or outcomes. We also show that Bayesian Ridge Regression can be thought
of as an online algorithm competing with all the Gaussian linear experts
Applying Winnow to Context-Sensitive Spelling Correction
Multiplicative weight-updating algorithms such as Winnow have been studied
extensively in the COLT literature, but only recently have people started to
use them in applications. In this paper, we apply a Winnow-based algorithm to a
task in natural language: context-sensitive spelling correction. This is the
task of fixing spelling errors that happen to result in valid words, such as
substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so
on. Previous approaches to this problem have been statistics-based; we compare
Winnow to one of the more successful such approaches, which uses Bayesian
classifiers. We find that: (1)~When the standard (heavily-pruned) set of
features is used to describe problem instances, Winnow performs comparably to
the Bayesian method; (2)~When the full (unpruned) set of features is used,
Winnow is able to exploit the new features and convincingly outperform Bayes;
and (3)~When a test set is encountered that is dissimilar to the training set,
Winnow is better than Bayes at adapting to the unfamiliar test set, using a
strategy we will present for combining learning on the training set with
unsupervised learning on the (noisy) test set.Comment: 9 page
Sequential anomaly detection in the presence of noise and limited feedback
This paper describes a methodology for detecting anomalies from sequentially
observed and potentially noisy data. The proposed approach consists of two main
elements: (1) {\em filtering}, or assigning a belief or likelihood to each
successive measurement based upon our ability to predict it from previous noisy
observations, and (2) {\em hedging}, or flagging potential anomalies by
comparing the current belief against a time-varying and data-adaptive
threshold. The threshold is adjusted based on the available feedback from an
end user. Our algorithms, which combine universal prediction with recent work
on online convex programming, do not require computing posterior distributions
given all current observations and involve simple primal-dual parameter
updates. At the heart of the proposed approach lie exponential-family models
which can be used in a wide variety of contexts and applications, and which
yield methods that achieve sublinear per-round regret against both static and
slowly varying product distributions with marginals drawn from the same
exponential family. Moreover, the regret against static distributions coincides
with the minimax value of the corresponding online strongly convex game. We
also prove bounds on the number of mistakes made during the hedging step
relative to the best offline choice of the threshold with access to all
estimated beliefs and feedback signals. We validate the theory on synthetic
data drawn from a time-varying distribution over binary vectors of high
dimensionality, as well as on the Enron email dataset.Comment: 19 pages, 12 pdf figures; final version to be published in IEEE
Transactions on Information Theor
Beyond Word N-Grams
We describe, analyze, and evaluate experimentally a new probabilistic model
for word-sequence prediction in natural language based on prediction suffix
trees (PSTs). By using efficient data structures, we extend the notion of PST
to unbounded vocabularies. We also show how to use a Bayesian approach based on
recursive priors over all possible PSTs to efficiently maintain tree mixtures.
These mixtures have provably and practically better performance than almost any
single model. We evaluate the model on several corpora. The low perplexity
achieved by relatively small PST mixture models suggests that they may be an
advantageous alternative, both theoretically and practically, to the widely
used n-gram models.Comment: 15 pages, one PostScript figure, uses psfig.sty and fullname.sty.
Revised version of a paper in the Proceedings of the Third Workshop on Very
Large Corpora, MIT, 199
Improved Generalization Bounds for Robust Learning
We consider a model of robust learning in an adversarial environment. The
learner gets uncorrupted training data with access to possible corruptions that
may be affected by the adversary during testing. The learner's goal is to build
a robust classifier that would be tested on future adversarial examples. We use
a zero-sum game between the learner and the adversary as our game theoretic
framework. The adversary is limited to possible corruptions for each input.
Our model is closely related to the adversarial examples model of Schmidt et
al. (2018); Madry et al. (2017).
Our main results consist of generalization bounds for the binary and
multi-class classification, as well as the real-valued case (regression). For
the binary classification setting, we both tighten the generalization bound of
Feige, Mansour, and Schapire (2015), and also are able to handle an infinite
hypothesis class . The sample complexity is improved from
to
. Additionally, we
extend the algorithm and generalization bound from the binary to the multiclass
and real-valued cases. Along the way, we obtain results on fat-shattering
dimension and Rademacher complexity of -fold maxima over function classes;
these may be of independent interest.
For binary classification, the algorithm of Feige et al. (2015) uses a regret
minimization algorithm and an ERM oracle as a blackbox; we adapt it for the
multi-class and regression settings. The algorithm provides us with
near-optimal policies for the players on a given training sample.Comment: Appearing at the 30th International Conference on Algorithmic
Learning Theory (ALT 2019
- …