Search CORE

17 research outputs found

SOTXTSTREAM: Density-based self-organizing clustering of text streams

Author: Bryant Avory C.
Cios Krzysztof J.
Publication venue: VCU Scholars Compass
Publication date: 01/01/2017
Field of study

A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

Crossref

Directory of Open Access Journals

VCU Scholars Compass

Data stream treatment using sliding windows with MapReduce

Author: Basgall María José
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 01/11/2016
Field of study

Knowledge Discovery in Databases (KDD) techniques present limitations when the volume of data to process is very large. Any KDD algorithm needs to do several iterations on the complete set of data in order to carry out its work. For continuous data stream processing it is necessary to store part of it in a temporal window. In this paper, we present a technique that uses the size of the temporal window in a dynamic way, based on the frequency of the data arrival and the response time of the KDD task. The obtained results show that this technique reaches a great size window where each example of the stream is used in more than one iteration of the KDD task.Facultad de Informátic

Data stream treatment using sliding windows with MapReduce

Author: Basgall María José
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 05/12/2016
Field of study

Exploratory analysis of textual data streams

Author: A. Ferrara
S. Castano
S. Montanelli
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015

AIR Universita degli studi di Milano

Clustering de un flujo de datos usando MapReduce

Author: Basgall María José
Estrebou César Armando
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 01/10/2016
Field of study

Las técnicas de agrupamiento (clustering) sobre flujo de datos (data stream) son una poderosa herramienta para determinar las características que tienen en común los datos provenientes del flujo. Para obtener buenos resultados es necesario almacenar gran parte de éste en una ventana temporal. En este artículo medimos una técnica que maneja el tamaño de la ventana temporal de manera dinámica utilizando un algoritmo de clustering implementado en el framework MapReduce. Los resultados obtenidos demuestran que esta técnica alcanza una ventana de gran tamaño logrando así que cada dato del flujo se utilice en más de una iteración del algoritmo de clustering permitiendo conseguir resultados similares independientemente de la velocidad de los datos del flujo. Los centroides resultantes de cada flujo de datos son semejantes a los que se consiguen haciendo un clustering sobre el conjunto de datos completo.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Clustering de un flujo de datos usando MapReduce

Author: Basgall María José
Estrebou César Armando
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 16/11/2016
Field of study

Clustering de un flujo de datos usando MapReduce

Author: Basgall María José
Estrebou César Armando
Hasperué Waldo
Naiouf Marcelo
Publication venue
Publication date: 01/10/2016
Field of study