18 research outputs found
Data stream treatment using sliding windows with MapReduce
Knowledge Discovery in Databases (KDD) techniques present limitations when the volume of data to process is very large. Any KDD algorithm needs to do several iterations on the complete set of data in order to carry out its work. For continuous data stream processing it is necessary to store part of it in a temporal window.
In this paper, we present a technique that uses the size of the temporal window in a dynamic way, based on the frequency of the data arrival and the response time of the KDD task. The obtained results show that this technique reaches a great size window where each example of the stream is used in more than one iteration of the KDD task.Facultad de Informátic
Data stream treatment using sliding windows with MapReduce
Knowledge Discovery in Databases (KDD) techniques present limitations when the volume of data to process is very large. Any KDD algorithm needs to do several iterations on the complete set of data in order to carry out its work. For continuous data stream processing it is necessary to store part of it in a temporal window.
In this paper, we present a technique that uses the size of the temporal window in a dynamic way, based on the frequency of the data arrival and the response time of the KDD task. The obtained results show that this technique reaches a great size window where each example of the stream is used in more than one iteration of the KDD task.Facultad de Informátic
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
Efficient Active Novel Class Detection for Data Stream Classification
International audienceOne substantial aspect of data stream classification is the possible appearance of novel unseen classes which must be identified in order to avoid confusion with existing classes. Detecting such new classes is omitted by most existing techniques and rarely addressed in the literature. We address this issue and propose an efficient method to identify novel class emergence in a multi-class data stream. The proposed method incrementally maintains a covered feature space of existing (known) classes. An incoming data point is designated as "insider" or "outsider" depending on whether it lies inside or outside the covered space area. An insider represents a possible instance of an existing class, while an outsider may be an instance of a possible novel class. The proposed method is able to iteratively select those insiders (resp. outsiders) that are more likely to be members of a novel (resp. an existing) class, and eventually distinguish the actual novel and existing classes accurately. We show how to actively query the labels of the identified novel class instances that are most uncertain. The method also allows us to balance between the rapidity of the novelty detection and its efficiency. Experiments using real world data prove the effectiveness of our approach for both the novel class detection and classification accuracy
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
Clustering de un flujo de datos usando MapReduce
Las técnicas de agrupamiento (clustering) sobre flujo de datos (data stream) son una poderosa herramienta para determinar las características que tienen en común los datos provenientes del flujo. Para obtener buenos resultados es necesario almacenar gran parte de éste en una ventana temporal. En este artículo medimos una técnica que maneja el tamaño de la ventana temporal de manera dinámica utilizando un algoritmo de clustering implementado en el framework MapReduce.
Los resultados obtenidos demuestran que esta técnica alcanza una ventana de gran tamaño logrando así que cada dato del flujo se utilice en más de una iteración del algoritmo de clustering permitiendo conseguir resultados similares independientemente de la velocidad de los datos del flujo. Los centroides resultantes de cada flujo de datos son semejantes a los que se consiguen haciendo un clustering sobre el conjunto de datos completo.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI
Clustering de un flujo de datos usando MapReduce
Las técnicas de agrupamiento (clustering) sobre flujo de datos (data stream) son una poderosa herramienta para determinar las características que tienen en común los datos provenientes del flujo. Para obtener buenos resultados es necesario almacenar gran parte de éste en una ventana temporal. En este artículo medimos una técnica que maneja el tamaño de la ventana temporal de manera dinámica utilizando un algoritmo de clustering implementado en el framework MapReduce.
Los resultados obtenidos demuestran que esta técnica alcanza una ventana de gran tamaño logrando así que cada dato del flujo se utilice en más de una iteración del algoritmo de clustering permitiendo conseguir resultados similares independientemente de la velocidad de los datos del flujo. Los centroides resultantes de cada flujo de datos son semejantes a los que se consiguen haciendo un clustering sobre el conjunto de datos completo.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI
A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams
Unlabelled data appear in many domains and are particularly relevant to
streaming applications, where even though data is abundant, labelled data is
rare. To address the learning problems associated with such data, one can
ignore the unlabelled data and focus only on the labelled data (supervised
learning); use the labelled data and attempt to leverage the unlabelled data
(semi-supervised learning); or assume some labels will be available on request
(active learning). The first approach is the simplest, yet the amount of
labelled data available will limit the predictive performance. The second
relies on finding and exploiting the underlying characteristics of the data
distribution. The third depends on an external agent to provide the required
labels in a timely fashion. This survey pays special attention to methods that
leverage unlabelled data in a semi-supervised setting. We also discuss the
delayed labelling issue, which impacts both fully supervised and
semi-supervised methods. We propose a unified problem setting, discuss the
learning guarantees and existing methods, explain the differences between
related problem settings. Finally, we review the current benchmarking practices
and propose adaptations to enhance them