43 research outputs found
Genetic graph-based in clustering applied to static and streaming data analysis
Tesis inédita leÃda en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de IngenierÃa Informática. Fecha de lectura: diciembre de 2014Unsupervised Learning Techniques have been widely used in Data Mining over the
last few years. These techniques try to identify patterns in a dataset blindly. Clustering
is one of the most promising elds in Unsupervised Learning. It consists on
grouping the data by similarity. This eld has generated several research works
which have tried to deal with di erent problems related to the pattern extraction
and data grouping processes. One of the most innovative clustering methodologies
is shape-based or continuity-based clustering which tries to group data according to
the form they de ne in the space.
This dissertation is focused on how to apply Genetic Algorithms to the continuitybased
clustering problems. Genetic Algorithms have been traditionally used in optimization
problems. They are featured by an encoding -which represents the solution
space; a population set of chromosomes -which are the potential solutions; and some
genetic operations -which are used to evolve the solutions in order to nd the best
chromosome or solution. The main idea is to take advantage of their potential, generating
new algorithms which can improve the performance of classical clustering
algorithms, and apply them to static and streaming data.
In order to design these algorithms, this dissertation has been based on the Spectral
Clustering algorithm. This algorithm studies the spectrum of a Similarity Graph
in order to de ne the clusters. The clusters de ned by Spectral Clustering usually
respect the data continuity. Using this idea as a starting point, di erent graph-based
genetic algorithms have been designed to deal with the continuity-based clustering
problem. The di erent algorithms developed have been divided in three generations:
The rst generation is based on genetic graph-based clustering algorithms. In
this generation we combined graph-based clustering and genetic algorithms to
generate a graph topology among the data, in order to nd the best way to cut
the graph. This cutting process is used to discriminate the nal clusters. The
main idea is to use hybrid algorithms which combine di erent metrics extracted
from graph theory. In order to evaluate the performance on real-world problems,
these algorithms have been also applied to text summarization.
The second generation is based on multi-objective genetic graph-based clustering
algorithms. This generation introduces the Pareto Front generated by the
di erent tness functions used in the genetic search. The Pareto Front is used
to study the solution space and provides more robust and accurate solutions.
During this generation we also used co-evolutionary algorithms to include the
number of clusters in the search space.
Finally, the last generation is focused on large and streaming data analysis.
During this generation the previous algorithms have been adapted to deal with
large data, combining di erent methodologies such as online clustering and
MapReduce. The main idea is to study their performance compared with other
algorithms.
The dissertation also includes a description of other graph-based bio-inspired algorithms,
in this case Ant Colony Optimization Clustering algorithms, which have
been designed during the dissertation, in order to extend the range of study to other
bio-inspired areas.
Finally, with the purpose of evaluating the algorithms of the di erent generations,
we have compared them with relevant and well-known clustering algorithms using
synthetic and real-world datasets extracted from the literature and the UCI Machine
Learning RepositoryLas técnicas de aprendizaje no supervisado han sido ampliamente utilizadas en minerÃa de datos en los últimos años. Estas técnicas tratan de extraer patrones de un conjunto de datos de forma ciega. Dentro de las mismas, el Clustering es uno de los campos más prometedores. Este consiste en la agrupación de los datos por similitud. Este campo ha generado varios trabajos de investigación que han tratado de hacer frente a diferentes problemas relacionados con la extracción de patrones y los procesos de agrupación de datos. Una de las metodologÃas de clustering más innovadoras se basa en agrupar los datos por continuidad, respetando la forma que estos definen en espacio en el que se encuentran. Esta tesis se centra en la manera de aplicar algoritmos genéticos a los problemas de clustering basado en continuidad. Los algoritmos genéticos han sido utilizados tradicionalmente en problemas de optimización. Se caracterizan por una codificación -que representa el espacio de soluciones-, una población o conjunto de cromosomas -que son las soluciones potenciales dentro de este espacio-, y algunas operaciones genéticas -que se utilizan para evolucionar las soluciones con el fin de encontrar el mejor cromosoma o solución-. La idea principal es aprovechar el pontencial de los algoritmos genéticos generando nuevos algoritmos que pueden mejorar el rendimiento de los algoritmos clásicos aplicados tanto a datos estáticos como a flujos continuos de datos. De cara a diseñaar estos algoritmos, esta tesis doctoral utiliza el algoritmo de Spectral Clustering como punto de partida. Este algoritmo estudia el espectro de un grafo de similitud con el fin de dfinir las agrupaciones o clusters. Los grupos de nidos por Spectral Clustering suelen respetar la continuidad de los datos. Utilizando esta idea, se han diseñado diferentes algoritmos genéticos basados en grafos para hacer frente al problema de agrupación basada en continuidad. Los diferentes algoritmos desarrollados se han dividido en tres generaciones: La primera generación se basa en algoritmos de clustering genéticos basados en grafos. En esta generación se han combinado técnicas de Graph Clustering y algoritmos genéticos para generar una topologÃa de grafo entre los datos, con el fin de encontrar la mejor manera de cortar el grafo. Este proceso de corte se utiliza para discriminar los grupos finales. La idea principal es utilizar algoritmos hÃbridos que combinan diferentes métricas extraÃdas de teorÃa de grafos. Con el fin de evaluar el comportamiento de los algoritmos en problemas del mundo real, estos algoritmos se han aplicado al problema de cómo generar resúmenes automáticos. La segunda generación se basa en algoritmos multi-objetivo de clustering genético basado en grafos. Esta generación introduce el Frente de Pareto, generado por las diferentes funciones de fitness utilizadas en la búsqueda genética. El frente de Pareto se utiliza para estudiar el espacio de soluciones y proporcionar soluciones más robustas y precisas. Durante esta generación también utilizamos algoritmos co-evolutivos de cara a incluir el número de clusters en el espacio de búsqueda Finalmente, la ultima generación se centra en el análisis de grandes cantidades y flujos de datos. Durante esta generación los algoritmos anteriormente mencionados se han adaptado para hacer frente a grandes volúmenes de datos, combinando diferentes metodologà as como el clustering online y MapReduce. La idea principal es estudiar su rendimiento en comparación con otros algoritmos. La tesis también incluye aportaciones de otros algoritmos bio-inspirados basados en grafos, en este caso, algoritmos de clustering usando optimización por colonias de hormigas. Estos algoritmos han sido diseñados durante el desarrollo de la tesis para ampliar el rango de estudio a otros entornos bio-inspirados. Por último, con el fin de evaluar los algoritmos de las diferentes generaciones, se han comparado con algoritmos de clustering conocidos. El rendimiento de estos algoritmos se ha medido utilizando conjuntos de datos sintéticos y reales extraÃdos de la literatura y del repositorio UCI de Machine Learning
A genetic approach to the graph and spectral clustering problem
Trabajo fin de máster en IngenierÃa en Informática y de Telecomunicacione
A multi-objective genetic graph-based clustering algorithm with memory optimization
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. H. D. Menéndez, D. F. Barrero, and D. Camacho, "A multi-objective genetic graph-based clustering algorithm with memory optimization", in 2013 IEEE Congress on Evolutionary Computation (CEC), 2013, pp. 3174 - 3181Clustering is one of the most versatile tools for data analysis. Over the last few years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the Spectral Clustering algorithm, which is based on graph cut: it initially generates a Similarity Graph using a distance measure and then uses its Graph Spectrum to find the best cut. Memory consuption is a serious limitation in that algorithm: The Similarity Graph representation usually requires a very large matrix with a high memory cost. This work proposes a new algorithm, based on a previous implementation named Genetic Graph-based Clustering (GGC), that improves the memory usage while maintaining the quality of the solution. The new algorithm, called Multi-Objective Genetic Graph-based Clustering (MOGGC), uses an evolutionary approach introducing a Multi-Objective Genetic Algorithm to manage a reduced version of the Similarity Graph. The experimental validation shows that MOGGC increases the memory efficiency, maintaining and improving the GGC results in the synthetic and real datasets used in the experiments. An experimental comparison with several classical clustering methods (EM, SC and K-means) has been included to show the efficiency of the proposed algorithm.This work has been partly supported by: Spanish Ministry of
Science and Education under project TIN2010-19872
Adaptive K-means algorithm for overlapped graph clustering
Electronic version of an article published as International Journal of Neural Systems 2, 5, 2012, DOI: 10.1142/S0129065712500189 © 2012 copyright World Scientific Publishing CompanyThe graph clustering problem has become highly relevant due to the growing interest of several research communities in social networks and their possible applications. Overlapped graph clustering algorithms try to find subsets of nodes that can belong to different clusters. In social network-based applications it is quite usual for a node of the network to belong to different groups, or communities, in the graph. Therefore, algorithms trying to discover, or analyze, the behavior of these networks needed to handle this feature, detecting and identifying the overlapped nodes. This paper shows a soft clustering approach based on a genetic algorithm where a new encoding is designed to achieve two main goals: first, the automatic adaptation of the number of communities that can be detected and second, the definition of several fitness functions that guide the searching process using some measures extracted from graph theory. Finally, our approach has been experimentally tested using the Eurovision contest dataset, a well-known social-based data network, to show how overlapped communities can be found using our method.This work has been partly supported by: Spanish
Ministry of Science and Education under project
TIN2010-19872 and the grant BES-2011-049875 from
the same Ministry
Hashing fuzzing: introducing input diversity to improve crash detection
The utility of a test set of program inputs is strongly influenced by its diversity and its size. Syntax coverage has become a standard proxy for diversity. Although more sophisticated measures exist, such as proximity of a sample to a uniform distribution, methods to use them tend to be type dependent. We use r-wise hash functions to create a novel, semantics preserving, testability transformation for C programs that we call HashFuzz. Use of HashFuzz improves the diversity of test sets produced by instrumentation-based fuzzers. We evaluate the effect of the HashFuzz transformation on eight programs from the Google Fuzzer Test Suite using four state-of-the-art fuzzers that have been widely used in previous research. We demonstrate pronounced improvements in the performance of the test sets for the transformed programs across all the fuzzers that we used. These include strong improvements in diversity in every case, maintenance or small improvement in branch coverage – up to 4.8% improvement in the best case, and significant improvement in unique crash detection numbers – between 28% to 97% increases compared to test sets for untransformed program