101 research outputs found
An Approach to Silhouette and Dunn Clustering Indices Applied to Big Data in Spark
K-Means and Bisecting K-Means clustering algorithms need the optimal number into which the dataset may be divided. Spark implementations of these algorithms include a method that is used to calculate this number. Unfortunately, this measurement presents a lack of precision because it only takes into account a sum of intra-cluster distances misleading the results. Moreover, this measurement has not been well-contrasted in previous researches about clustering indices. Therefore, we introduce a new Spark implementation of Silhouette and Dunn indices. These clustering indices have been tested in previous works. The results obtained show the potential of Silhouette and Dunn to deal with Big Data.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-
An approach to validity indices for clustering techniques in Big Data
Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-
New internal and external validation indices for clustering in Big Data
Esta tesis, presentada como un compendio de artículos de investigación,
analiza el concepto de índices de validación de clustering y aporta nuevas
medidas de bondad para conjuntos de datos que podrían considerarse Big
Data debido a su volumen. Además, estas medidas han sido aplicadas en
proyectos reales y se propone su aplicación futura para mejorar algoritmos
de clustering.
El clustering es una de las técnicas de aprendizaje automático no supervisado
más usada. Esta técnica nos permite agrupar datos en clusters de
manera que, aquellos datos que pertenezcan al mismo cluster tienen características
o atributos con valores similares, y a su vez esos datos son disimilares
respecto a aquellos que pertenecen a los otros clusters. La similitud de los
datos viene dada normalmente por la cercanía en el espacio, teniendo en
cuenta una función de distancia. En la literatura existen los llamados índices
de validación de clustering, los cuales podríamos definir como medidas para
cuantificar la calidad de un resultado de clustering. Estos índices se dividen
en dos tipos: índices de validación internos, que miden la calidad del clustering
en base a los atributos con los que se han construido los clusters; e
índices de validación externos, que son aquellos que cuantifican la calidad del
clustering a partir de atributos que no han intervenido en la construcción de
los clusters, y que normalmente son de tipo nominal o etiquetas.
En esta memoria se proponen dos índices de validación internos para clustering
basados en otros índices existentes en la literatura, que nos permiten
trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un
tiempo razonable. Los índices propuestos han sido testeados en datasets sintéticos
y comparados con otros índices de la literatura. Las conclusiones de
este trabajo indican que estos índices ofrecen resultados muy prometedores
frente a sus competidores.
Por otro lado, se ha diseñado un nuevo índice de validación externo de
clustering basado en el test estadístico chi cuadrado. Este índice permite
medir la calidad del clustering basando el resultado en cómo han quedado
distribuidos los clusters respecto a una etiqueta dada en la distribución. Los
resultados de este índice muestran una mejora significativa frente a otros
índices externos de la literatura y en datasets de diferentes dimensiones y características.
Además, estos índices propuestos han sido aplicados en tres proyectos
con datos reales cuyas publicaciones están incluidas en esta tesis doctoral.
Para el primer proyecto se ha desarrollado una metodología para analizar el
consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado
un análisis de clustering óptimo aplicando los índices internos mencionados
anteriormente. En el segundo proyecto se ha trabajado tanto los índices internos
como con los externos para realizar un análisis comparativo del mercado
laboral español en dos periodos económicos distintos. Este análisis se realizó
usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y
los resultados podrían tenerse en cuenta para ayudar a la toma de decisión
en mejoras de políticas de empleo. En el tercer proyecto se ha trabajado con
datos de los clientes de una compañía eléctrica para caracterizar los tipos
de consumidores que existen. En este estudio se han analizado los patrones
de consumo para que las compañías eléctricas puedan ofertar nuevas tarifas
a los consumidores, y éstos puedan adaptarse a estas tarifas con el objetivo
de optimizar la generación de energía eliminando los picos de consumo que
existen la actualidad.This thesis, presented as a compendium of research articles, analyses
the concept of clustering validation indices and provides new measures of
goodness for datasets that could be considered Big Data. In addition, these
measures have been applied in real projects and their future application is
proposed for the improvement of clustering algorithms.
Clustering is one of the most popular unsupervised machine learning
techniques. This technique allows us to group data into clusters so that the
instances that belong to the same cluster have characteristics or attributes
with similar values, and are dissimilar to those that belong to the other
clusters. The similarity of the data is normally given by the proximity in
space, which is measured using a distance function. In the literature, there
are so-called clustering validation indices, which can be defined as measures
for the quantification of the quality of a clustering result. These indices are
divided into two types: internal validation indices, which measure the quality
of clustering based on the attributes with which the clusters have been built;
and external validation indices, which are those that quantify the quality of
clustering from attributes that have not intervened in the construction of
the clusters, and that are normally of nominal type or labels.
In this doctoral thesis, two internal validation indices are proposed for
clustering based on other indices existing in the literature, which enable
large amounts of data to be handled, and provide the results in a reasonable
time. The proposed indices have been tested with synthetic datasets and
compared with other indices in the literature. The conclusions of this work
indicate that these indices offer very promising results in comparison with
their competitors.
On the other hand, a new external clustering validation index based on
the chi-squared statistical test has been designed. This index enables the
quality of the clustering to be measured by basing the result on how the
clusters have been distributed with respect to a given label in the distribution.
The results of this index show a significant improvement compared to
other external indices in the literature when used with datasets of different
dimensions and characteristics.
In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral
thesis. For the first project, a methodology has been developed to analyse
the electrical consumption of buildings in a smart city. For this study, an
optimal clustering analysis has been carried out by applying the aforementioned
internal indices. In the second project, both internal and external
indices have been applied in order to perform a comparative analysis of the
Spanish labour market in two different economic periods. This analysis was
carried out using data from the Ministry of Labour, Migration, and Social
Security, and the results could be taken into account to help decision-making
for the improvement of employment policies. In the third project, data from
the customers of an electric company has been employed to characterise the
different types of existing consumers. In this study, consumption patterns
have been analysed so that electricity companies can offer new rates to consumers.
Conclusions show that consumers could adapt their usage to these
rates and hence the generation of energy could be optimised by eliminating
the consumption peaks that currently exist
Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities
New technologies such as sensor networks have been incorporated into the management
of buildings for organizations and cities. Sensor networks have led to an exponential increase in the
volume of data available in recent years, which can be used to extract consumption patterns for the
purposes of energy and monetary savings. For this reason, new approaches and strategies are needed
to analyze information in big data environments. This paper proposes a methodology to extract
electric energy consumption patterns in big data time series, so that very valuable conclusions can
be made for managers and governments. The methodology is based on the study of four clustering
validity indices in their parallelized versions along with the application of a clustering technique.
In particular, this work uses a voting system to choose an optimal number of clusters from the results
of the indices, as well as the application of the distributed version of the k-means algorithm included
in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the
years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition,
the performance of the proposed methodology is evaluated using synthetic big data, which cab
represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered
are proposed to optimize energy usage across the university campus.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-RJunta de Andalucía P12-TIC-172
Analysis of the evolution of the Spanish labour market through unsupervised learning
Unemployment in Spain is one of the biggest concerns of its inhabitants. Its unemployment rate is the second highest in the European Union, and in the second quarter of 2018 there is a 15.2% unemployment rate, some 3.4 million unemployed. Construction is one of the activity sectors that have suffered the most from the economic crisis. In addition, the economic crisis affected in different ways to the labour market in terms of occupation level or location. The aim of this paper is to discover how the labour market is organised taking into account the jobs that workers get during two periods: 2011-2013, which corresponds to the economic crisis period, and 2014-2016, which was a period of economic recovery. The data used are official records of the Spanish administration corresponding to 1.9 and 2.4 million job placements, respectively. The labour market was analysed by applying unsupervised machine learning techniques to obtain a clear and structured information on the employment generation process and the underlying labour mobility. We have applied two clustering methods with two different technologies, and the results indicate that there were some movements in the Spanish labour market which have changed the physiognomy of some of the jobs. The analysis reveals the changes in the labour market: the crisis forces greater geographical mobility and favours the subsequent emergence of new job sources. Nevertheless, there still exist some clusters that remain stable despite the crisis. We may conclude that we have achieved a characterisation of some important groups of workers in Spain. The methodology used, being supported by Big Data techniques, would serve to analyse any alternative job market.Ministerio de Economía y Competitividad TIN2014-55894-C2-R y TIN2017-88209-C2-2-R, CO2017-8678
Electricity clustering framework for automatic classification of customer loads
Clustering in energy markets is a top topic with high significance on expert and intelligent systems. The main impact of is paper is the proposal of a new clustering framework for the automatic classification of electricity customers’ loads. An automatic selection of the clustering classification algorithm is also highlighted. Finally, new customers can be assigned to a predefined set of clusters in the classificationphase. The computation time of the proposed framework is less than that of previous classification tech- niques, which enables the processing of a complete electric company sample in a matter of minutes on a personal computer. The high accuracy of the predicted classification results verifies the performance of the clustering technique. This classification phase is of significant assistance in interpreting the results, and the simplicity of the clustering phase is sufficient to demonstrate the quality of the complete mining framework.Ministerio de Economía y Competitividad TEC2013-40767-RMinisterio de Economía y Competitividad IDI- 2015004
Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering
Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test the algorithm developed, studying which of them provides better results to obtain the best possible clustering
Desarrollo de modelos basados en patrones para la predicción de series temporales en entornos Big Data
Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111Esta Tesis Doctoral se presenta mediante la modalidad de compendio de publicaciones y en ella se aportan distintas contribuciones científicas en Congresos Internacionales y revistas con alto índice de impacto en el Journal of Citation Reports (JCR). Durante los cinco años de investigación a tiempo parcial, se ha realizado una investigación encaminada al estudio, análisis y predicción de grandes conjuntos de series temporales, principalmente de tipo energético. Para ello, se han seguido las últimas tendencias tecnológicas en el ámbito de la computación distribuida, desarrollando la experimentación íntegramente en Scala, el lenguaje nativo del framework Apache Spark, realizando las pruebas experimentales en entornos reales como Amazon Web Services u Open Telekom Cloud.
La primera fase de la Tesis Doctoral se centra en el desarrollo y aplicación de una metodología que permite analizar de manera eficiente conjuntos de datos que contienen series temporales de consumo eléctrico, generados por la red de contadores eléctricos inteligentes instalados en la Universidad Pablo de Olavide. La metodología propuesta se enfoca principalmente en la correcta aplicación en entornos distribuidos del algoritmo de clustering K-means a grandes conjuntos de datos, permitiendo segmentar conjuntos de observaciones en grupos distintos con características similares. Esta tarea se realiza utilizando una versión paralelizada del algoritmo llamado K-means++, incluido en la Machine Learning Library de Apache Spark. Para la elección del número óptimo de clusters, se adopta una estrategia en la que se evalúan distintos índices de validación de clusters tales como el Within Set Sum of Squared Error, Davies-Bouldin, Dunn y Silhouette, todos ellos desarrollados para su aplicación en entornos distribuidos.
Los resultados de esta experimentación se expusieron en 13th International Conference on Distributed Computing and Artificial Intelligence. Posteriormente, se amplió la experimentación y la metodología, resultando en un artículo publicado en la revista Energies, indexada en JCR con categoría Q3.
La segunda parte del trabajo realizado consiste en el diseño de una metodología y desarrollo de un algoritmo capaz de pronosticar eficazmente series temporales en entornos Big Data. Para ello, se analizó el conocido algoritmo Pattern Sequence-based Forecasting (PSF), con dos objetivos principales: por un lado, su adaptación para aplicarlo en entornos escalables y distribuidos y, por otro lado, la mejora de las predicciones que realiza, enfocándolo a la explotación de grandes conjuntos de datos de una manera eficiente. En este sentido, se ha desarrollado en lenguaje Scala un algoritmo llamado bigPSF, que se integra en el marco de una completa metodología diseñada para a pronosticar el consumo energético de una Smart City. Finalmente, se desarrolló una variante del algoritmo bigPSF llamada MV-bigPSF, capaz de predecir series temporales multivariables.
Esta experimentación se ha plasmado en dos artículos científicos publicados en las revistas Information Sciences (para el artículo relativo al algoritmo bigPSF) y Applied Energy (relativo al estudio de la versión multivariable del mismo), ambas con un índice de impacto JCR con categoría Q1.Universidad Pablo de Olavide de Sevilla. Escuela de Doctorad
Predicting the Readiness of Indonesia Manufacturing Companies toward Industry 4.0: A Machine Learning Approach
This research discusses Indonesias readiness to implement industry 4.0. We classified the Indonesia manufacturing companies readiness, which is listed in the Indonesia Stock Exchange, in industry 4.0 based on the 2018 annual reports. We considered 38 variables from those reports and reduced them using principal component analysis into 11 variables. Using clustering analysis on the reduced dataset, we found three clusters representing the readiness level in implementing industry 4.0. Finally, we used the decision tree for analysing the classification rules. As the finding of this study, Total book value of the machine is the variable that defined the readiness of a company in industry 4.0. The bigger those values are, the more ready a company to compete in industry 4.0. The other measures, i.e., Total cost of revenue by total revenue; Direct labor cost; Total revenue/Total employee and Transportation cost/Total revenue, will define the readiness of a manufacturing company to transform into industry 4.0. or not ready to transform into industry 4.0
- …