38,647 research outputs found

    Batch-Incremental Learning for Mining Data Streams

    Get PDF
    The data stream model for data mining places harsh restrictions on a learning algorithm. First, a model must be induced incrementally. Second, processing time for instances must keep up with their speed of arrival. Third, a model may only use a constant amount of memory, and must be ready for prediction at any point in time. We attempt to overcome these restrictions by presenting a data stream classification algorithm where the data is split into a stream of disjoint batches. Single batches of data can be processed one after the other by any standard non-incremental learning algorithm. Our approach uses ensembles of decision trees. These tree ensembles are iteratively merged into a single interpretable model of constant maximal size. Using benchmark datasets the algorithm is evaluated for accuracy against state-of-the-art algorithms that make use of the entire dataset

    Private Incremental Regression

    Full text link
    Data is continuously generated by modern data sources, and a recent challenge in machine learning has been to develop techniques that perform well in an incremental (streaming) setting. In this paper, we investigate the problem of private machine learning, where as common in practice, the data is not given at once, but rather arrives incrementally over time. We introduce the problems of private incremental ERM and private incremental regression where the general goal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Our first contribution is a generic transformation of private batch ERM mechanisms into private incremental ERM mechanisms, based on a simple idea of invoking the private batch ERM procedure at some regular time intervals. We take this construction as a baseline for comparison. We then provide two mechanisms for the private incremental regression problem. Our first mechanism is based on privately constructing a noisy incremental gradient function, which is then used in a modified projected gradient procedure at every timestep. This mechanism has an excess empirical risk of d\approx\sqrt{d}, where dd is the dimensionality of the data. While from the results of [Bassily et al. 2014] this bound is tight in the worst-case, we show that certain geometric properties of the input and constraint set can be used to derive significantly better results for certain interesting regression problems.Comment: To appear in PODS 201

    A Note on Batch and Incremental Learnability

    Get PDF
    AbstractAccording to Gold's criterion of identification in the limit, a learner, presented with data about a concept, is allowed to make a finite number of incorrect hypotheses before converging to a correct hypothesis. If, on the other hand, the learner is allowed to make only one conjecture which has to be correct, the resulting criterion of success is known as finite identification Identification in the limit may be viewed as an idealized model for incremental learning whereas finite identification may be viewed as an idealized model for batch learning. The present paper establishes a surprising fact that the collections of recursively enumerable languages that can be finite identified (batch learned in the ideal case) from both positive and negative data can also be identified in the limit (incrementally learned in the ideal case) from only positive data. It is often difficult to extract insights about practical learning systems from abstract theorems in inductive inference. However, this result may be seen as carrying a moral for the design of learning systems, as it yields, in theidealcase of no inaccuracies, an algorithm for converting batch systems that learn from both positive and negative data into incremental systems that learn from only positive data without any loss in learning power. This is achieved by the incremental system simulating the batch system in incremental fashion and using the heuristic of “localized closed-world assumption” to generate negative data

    Incremental Learning of Nonparametric Bayesian Mixture Models

    Get PDF
    Clustering is a fundamental task in many vision applications. To date, most clustering algorithms work in a batch setting and training examples must be gathered in a large group before learning can begin. Here we explore incremental clustering, in which data can arrive continuously. We present a novel incremental model-based clustering algorithm based on nonparametric Bayesian methods, which we call Memory Bounded Variational Dirichlet Process (MB-VDP). The number of clusters are determined flexibly by the data and the approach can be used to automatically discover object categories. The computational requirements required to produce model updates are bounded and do not grow with the amount of data processed. The technique is well suited to very large datasets, and we show that our approach outperforms existing online alternatives for learning nonparametric Bayesian mixture models

    Batch and incremental learning of decision trees

    Get PDF
    corecore