20 research outputs found

    Drift Detection Using Uncertainty Distribution Divergence

    Get PDF
    Concept drift is believed to be prevalent inmost data gathered from naturally occurring processes andthus warrants research by the machine learning community.There are a myriad of approaches to concept drift handlingwhich have been shown to handle concept drift with varyingdegrees of success. However, most approaches make the keyassumption that the labelled data will be available at nolabelling cost shortly after classification, an assumption whichis often violated. The high labelling cost in many domainsprovides a strong motivation to reduce the number of labelledinstances required to handle concept drift. Explicit detectionapproaches that do not require labelled instances to detectconcept drift show great promise for achieving this. Ourapproach Confidence Distribution Batch Detection (CDBD)provides a signal correlated to changes in concept without usinglabelled data. We also show how this signal combined with atrigger and a rebuild policy can maintain classifier accuracywhile using a limited amount of labelled data

    Clustering based active learning for evolving data streams

    Get PDF
    Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported

    Clustering based active learning for evolving data streams

    Get PDF
    Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Detecting Concept Drift With Neural Network Model Uncertainty

    Get PDF
    Deployed machine learning models are confronted with the problem of changing data over time, a phenomenon also called concept drift. While existing approaches of concept drift detection already show convincing results, they require true labels as a prerequisite for successful drift detection. Especially in many real-world application scenarios-like the ones covered in this work-true labels are scarce, and their acquisition is expensive. Therefore, we introduce a new algorithm for drift detection, Uncertainty Drift Detection (UDD), which is able to detect drifts without access to true labels. Our approach is based on the uncertainty estimates provided by a deep neural network in combination with Monte Carlo Dropout. Structural changes over time are detected by applying the ADWIN technique on the uncertainty estimates, and detected drifts trigger a retraining of the prediction model. In contrast to input data-based drift detection, our approach considers the effects of the current input data on the properties of the prediction model rather than detecting change on the input data only (which can lead to unnecessary retrainings). We show that UDD outperforms other state-of-the-art strategies on two synthetic as well as ten real-world data sets for both regression and classification tasks

    Drift Detection using Uncertainty Distribution Divergence

    Get PDF
    Data generated from naturally occurring processes tends to be non-stationary. For example, seasonal and gradual changes in climate data and sudden changes in financial data. In machine learning the degradation in classifier performance due to such changes in the data is known as concept drift and there are many approaches to detecting and handling it. Most approaches to detecting concept drift, however, make the assumption that true classes for test examples will be available at no cost shortly after classification and base the detection of concept drift on measures relying on these labels. The high labelling cost in many domains provides a strong motivation to reduce the number of labelled instances required to detect and handle concept drift. Triggered detection approaches that do not require labelled instances to detect concept drift show great promise for achieving this. In this paper we present Confidence Distribution Batch Detection (CDBD), an approach that provides a signal correlated to changes in concept without using labelled data. This signal combined with a trigger and a rebuild policy can maintain classifier accuracy which, in most cases, matches the accuracy achieved using classification error based detection techniques but using only a limited amount of labelled data

    Integrando enfoques de medición y evaluación con minería de datos y procesamiento de flujos

    Get PDF
    Este línea de trabajo aborda la problemática de los modelos de clasificación aplicados a flujos continuos de datos, variantes en el tiempo y semi-estructurados (según se define en [1]), usando el marco conceptual de medición y evaluación C-INCAMI (Context - Information Need, Concept model, Attribute, Metric and Indicator [2,3]). Esta investigación integra ambos enfoques, con el fin de generar y soportar un modelo de decisión adaptable al vuelo, que a su vez contribuya al proceso de toma de decisiones en diferentes contextos.Eje: Ingeniería de Software y Bases de DatosRed de Universidades con Carreras en Informática (RedUNCI

    Integrando enfoques de medición y evaluación con minería de datos y procesamiento de flujos

    Get PDF
    Este línea de trabajo aborda la problemática de los modelos de clasificación aplicados a flujos continuos de datos, variantes en el tiempo y semi-estructurados (según se define en [1]), usando el marco conceptual de medición y evaluación C-INCAMI (Context - Information Need, Concept model, Attribute, Metric and Indicator [2,3]). Esta investigación integra ambos enfoques, con el fin de generar y soportar un modelo de decisión adaptable al vuelo, que a su vez contribuya al proceso de toma de decisiones en diferentes contextos.Eje: Ingeniería de Software y Bases de DatosRed de Universidades con Carreras en Informática (RedUNCI

    Knowledge discovery in data streams

    Full text link
    Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resources are abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms may well be the solution to the above problem. In this entry, we discuss this issue and present an overview of recent works

    Learning Concept Drift Using Adaptive Training Set Formation Strategy

    Get PDF
    We live in a dynamic world, where changes are a part of everyday ‘s life. When there is a shift in data, the classification or prediction models need to be adaptive to the changes. In data mining the phenomenon of change in data distribution over time is known as concept drift. In this research, we propose an adaptive supervised learning with delayed labeling methodology. As a part of this methodology, we introduce an adaptive training set formation algorithm called SFDL, which is based on selective training set formation. Our proposed solution considered as the first systematic training set formation approach that take into account delayed labeling problem. It can be used with any base classifier without the need to change the implementation or setting of this classifier. We test our algorithm implementation using synthetic and real dataset from various domains which might have different drift types (sudden, gradual, incremental recurrences) with different speed of change. The experimental results confirm improvement in classification accuracy as compared to ordinary classifier for all drift types. Our approach is able to increase the classifications accuracy with 20% in average and 56% in the best cases of our experimentations and it has not been worse than the ordinary classifiers in any case. Finally a comparison study with other four related methods to deal with changing in user interest over time and handle recurrence drift is performed. Results indicate the effectiveness of the proposed method over other methods in terms of classification accuracy
    corecore