352 research outputs found

    Improving adaptive bagging methods for evolving data streams

    Get PDF
    We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples

    Combining similarity in time and space for training set formation under concept drift

    Get PDF
    Concept drift is a challenge in supervised learning for sequential data. It describes a phenomenon when the data distributions change over time. In such a case accuracy of a classifier benefits from the selective sampling for training. We develop a method for training set selection, particularly relevant when the expected drift is gradual. Training set selection at each time step is based on the distance to the target instance. The distance function combines similarity in space and in time. The method determines an optimal training set size online at every time step using cross validation. It is a wrapper approach, it can be used plugging in different base classifiers. The proposed method shows the best accuracy in the peer group on the real and artificial drifting data. The method complexity is reasonable for the field applications

    On the Window Size for Classification in Changing Environments

    Get PDF
    Classification in changing environments (commonly known as concept drift) requires adaptation of the classifier to accommodate the changes. One approach is to keep a moving window on the streaming data and constantly update the classifier on it. Here we consider an abrupt change scenario where one set of probability distributions of the classes is instantly replaced with another. For a fixed ā€˜transition periodā€™ around the change, we derive a generic relationship between the size of the moving window and the classification error rate. We derive expressions for the error in the transition period and for the optimal window size for the case of two Gaussian classes where the concept change is a geometrical displacement of the whole class configuration in the space. A simple window resize strategy based on the derived relationship is proposed and compared with fixed-size windows on a real benchmark data set data set (Electricity Market)

    An ensemble-based computational approach for incremental learning in non-stationary environments related to schema- and scaffolding-based human learning

    Get PDF
    The principal dilemma in a learning process, whether human or computer, is adapting to new information, especially in cases where this new information conflicts with what was previously learned. The design of computer models for incremental learning is an emerging topic for classification and prediction of large-scale data streams undergoing change in underlying class distributions (definitions) over time; yet currently, they often ignore significant foundational learning theory that has been developed in the domain of human learning. This shortfall leads to many deficiencies in the ability to organize existing knowledge and to retain relevant knowledge for long periods of time. In this work, we introduce a unique computer-learning algorithm for incremental knowledge acquisition using an ensemble of classifiers, Learn++.NSE (Non-Stationary Environments), specifically for the case where the nature of knowledge to be learned is evolving. Learn++.NSE is a novel approach to evaluating and organizing existing knowledge (classifiers) according to the most recent data environment. Under this architecture, we address the learning problem at both the learner and supervisor end, discussing and implementing three main approaches: knowledge weighting/organization, forgetting prior knowledge, and change/drift detection. The framework is evaluated on a variety of canonical and real-world data streams (weather prediction, electricity price prediction, and spam detection). This study reveals the catastrophic effect of forgetting prior knowledge, supporting the organization technique proposed by Learn++.NSE as the most consistent performer during various drift scenarios, while also addressing the sheer difficulty in designing a system that strikes a balance between maintaining all knowledge and making decisions based only on relevant knowledge, especially in severe, unpredictable environments which are often encountered in the real-world

    Incremental learning of concept drift from imbalanced data

    Get PDF
    Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments
    • ā€¦
    corecore