Search CORE

18 research outputs found

Data stream treatment using sliding windows with MapReduce

Author: Basgall María José
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 01/11/2016
Field of study

Knowledge Discovery in Databases (KDD) techniques present limitations when the volume of data to process is very large. Any KDD algorithm needs to do several iterations on the complete set of data in order to carry out its work. For continuous data stream processing it is necessary to store part of it in a temporal window. In this paper, we present a technique that uses the size of the temporal window in a dynamic way, based on the frequency of the data arrival and the response time of the KDD task. The obtained results show that this technique reaches a great size window where each example of the stream is used in more than one iteration of the KDD task.Facultad de Informátic

Data stream treatment using sliding windows with MapReduce

Author: Basgall María José
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 05/12/2016
Field of study

Clustering based active learning for evolving data streams

Author: Bifet Albert
Ienco Dino
Pfahringer Bernhard
Žliobaitė Indrė
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported

Crossref

Research Commons@Waikato

HAL Descartes

HAL-CIRAD

Efficient Active Novel Class Detection for Data Stream Classification

Author: Belaïd Abdel
Belaïd Yolande
Bouguelia Mohamed-Rafik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/08/2014
Field of study

International audienceOne substantial aspect of data stream classification is the possible appearance of novel unseen classes which must be identified in order to avoid confusion with existing classes. Detecting such new classes is omitted by most existing techniques and rarely addressed in the literature. We address this issue and propose an efficient method to identify novel class emergence in a multi-class data stream. The proposed method incrementally maintains a covered feature space of existing (known) classes. An incoming data point is designated as "insider" or "outsider" depending on whether it lies inside or outside the covered space area. An insider represents a possible instance of an existing class, while an outsider may be an instance of a possible novel class. The proposed method is able to iteratively select those insiders (resp. outsiders) that are more likely to be members of a novel (resp. an existing) class, and eventually distinguish the actual novel and existing classes accurately. We show how to actively query the labels of the identified novel class instances that are most uncertain. The method also allows us to balance between the rapidity of the novelty detection and its efficiency. Experiments using real world data prove the effectiveness of our approach for both the novel class detection and classification accuracy

Crossref

INRIA a CCSD electronic archive server

Clustering based active learning for evolving data streams

Author: Adams E
Chen TC
Cobbold S
Daley S
Fairchild PJ
Graca L
Waldmann H
Publication venue: Springer
Publication date: 01/01/2006
Field of study

Crossref

Research Commons@Waikato

Queensland University of Technology ePrints Archive

Oxford University Research Archive

Clustering de un flujo de datos usando MapReduce

Author: Basgall María José
Estrebou César Armando
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 01/10/2016
Field of study

Las técnicas de agrupamiento (clustering) sobre flujo de datos (data stream) son una poderosa herramienta para determinar las características que tienen en común los datos provenientes del flujo. Para obtener buenos resultados es necesario almacenar gran parte de éste en una ventana temporal. En este artículo medimos una técnica que maneja el tamaño de la ventana temporal de manera dinámica utilizando un algoritmo de clustering implementado en el framework MapReduce. Los resultados obtenidos demuestran que esta técnica alcanza una ventana de gran tamaño logrando así que cada dato del flujo se utilice en más de una iteración del algoritmo de clustering permitiendo conseguir resultados similares independientemente de la velocidad de los datos del flujo. Los centroides resultantes de cada flujo de datos son semejantes a los que se consiguen haciendo un clustering sobre el conjunto de datos completo.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

Clustering de un flujo de datos usando MapReduce

Author: Basgall María José
Estrebou César Armando
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 16/11/2016
Field of study

A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams

Author: Bifet Albert
Gomes Heitor Murilo
Grzenda Maciej
Mello Rodrigo
Nguyen Minh Huong Le
Read Jesse
Publication venue
Publication date: 16/06/2021
Field of study

Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leverage the unlabelled data (semi-supervised learning); or assume some labels will be available on request (active learning). The first approach is the simplest, yet the amount of labelled data available will limit the predictive performance. The second relies on finding and exploiting the underlying characteristics of the data distribution. The third depends on an external agent to provide the required labels in a timely fashion. This survey pays special attention to methods that leverage unlabelled data in a semi-supervised setting. We also discuss the delayed labelling issue, which impacts both fully supervised and semi-supervised methods. We propose a unified problem setting, discuss the learning guarantees and existing methods, explain the differences between related problem settings. Finally, we review the current benchmarking practices and propose adaptations to enhance them

arXiv.org e-Print Archive

Victoria University of Wellington