Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-