38,647 research outputs found
Batch-Incremental Learning for Mining Data Streams
The data stream model for data mining places harsh restrictions on a learning algorithm. First, a model must be induced incrementally. Second, processing time for instances must keep up with their speed of arrival. Third, a model may only use a constant amount of memory, and must be ready for prediction at any point in time. We attempt to overcome these restrictions by presenting a data stream classification algorithm where the data is split into a stream of disjoint batches. Single batches of data can be processed one after the other by any standard non-incremental learning algorithm. Our approach uses ensembles of decision trees. These tree ensembles are iteratively merged into a single interpretable model of constant maximal size. Using benchmark datasets the algorithm is evaluated for accuracy against state-of-the-art algorithms that make use of the entire dataset
Private Incremental Regression
Data is continuously generated by modern data sources, and a recent challenge
in machine learning has been to develop techniques that perform well in an
incremental (streaming) setting. In this paper, we investigate the problem of
private machine learning, where as common in practice, the data is not given at
once, but rather arrives incrementally over time.
We introduce the problems of private incremental ERM and private incremental
regression where the general goal is to always maintain a good empirical risk
minimizer for the history observed under differential privacy. Our first
contribution is a generic transformation of private batch ERM mechanisms into
private incremental ERM mechanisms, based on a simple idea of invoking the
private batch ERM procedure at some regular time intervals. We take this
construction as a baseline for comparison. We then provide two mechanisms for
the private incremental regression problem. Our first mechanism is based on
privately constructing a noisy incremental gradient function, which is then
used in a modified projected gradient procedure at every timestep. This
mechanism has an excess empirical risk of , where is the
dimensionality of the data. While from the results of [Bassily et al. 2014]
this bound is tight in the worst-case, we show that certain geometric
properties of the input and constraint set can be used to derive significantly
better results for certain interesting regression problems.Comment: To appear in PODS 201
A Note on Batch and Incremental Learnability
AbstractAccording to Gold's criterion of identification in the limit, a learner, presented with data about a concept, is allowed to make a finite number of incorrect hypotheses before converging to a correct hypothesis. If, on the other hand, the learner is allowed to make only one conjecture which has to be correct, the resulting criterion of success is known as finite identification Identification in the limit may be viewed as an idealized model for incremental learning whereas finite identification may be viewed as an idealized model for batch learning. The present paper establishes a surprising fact that the collections of recursively enumerable languages that can be finite identified (batch learned in the ideal case) from both positive and negative data can also be identified in the limit (incrementally learned in the ideal case) from only positive data. It is often difficult to extract insights about practical learning systems from abstract theorems in inductive inference. However, this result may be seen as carrying a moral for the design of learning systems, as it yields, in theidealcase of no inaccuracies, an algorithm for converting batch systems that learn from both positive and negative data into incremental systems that learn from only positive data without any loss in learning power. This is achieved by the incremental system simulating the batch system in incremental fashion and using the heuristic of “localized closed-world assumption” to generate negative data
Incremental Learning of Nonparametric Bayesian Mixture Models
Clustering is a fundamental task in many vision applications.
To date, most clustering algorithms work in a
batch setting and training examples must be gathered in a
large group before learning can begin. Here we explore
incremental clustering, in which data can arrive continuously.
We present a novel incremental model-based clustering
algorithm based on nonparametric Bayesian methods,
which we call Memory Bounded Variational Dirichlet
Process (MB-VDP). The number of clusters are determined
flexibly by the data and the approach can be used to automatically
discover object categories. The computational requirements
required to produce model updates are bounded
and do not grow with the amount of data processed. The
technique is well suited to very large datasets, and we show
that our approach outperforms existing online alternatives
for learning nonparametric Bayesian mixture models
- …