An approach to validity indices for clustering techniques in Big Data

García Gutiérrez, Jorge; Luna Romera, José María; Martínez Ballesteros, María del Mar; Riquelme Santos, José Cristóbal

An approach to validity indices for clustering techniques in Big Data

Authors: Jorge García Gutiérrez
José María Luna Romera
María del Mar Martínez Ballesteros
José Cristóbal Riquelme Santos
Publication date: 1 January 2018
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Clustering analysis is one of the most used Machine Learning techniques to discover groups among data objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist several cluster validity indices that help us to approximate the optimal number of clusters of the dataset. However, such indices are not suitable to deal with Big Data due to its size limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low computational time. Our indices are based on redefinitions of traditional indices by simplifying the intra-cluster distance calculation. Two types of tests have been carried out over 28 synthetic datasets to analyze the performance of the proposed indices. First, we test the indices with small and medium size datasets to verify that our indices have a similar effectiveness to the traditional ones. Subsequently, tests on datasets of up to 11 million records and 20 features have been executed to check their efficiency. The results show that both indices can handle Big Data in a very low computational time with an effectiveness similar to the traditional indices using Apache Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

idUS. Depósito de Investigación Universidad de Sevilla

oai:idus.us.es:11441/132065

Last time updated on 19/05/2022