19 research outputs found

    New internal and external validation indices for clustering in Big Data

    Get PDF
    Esta tesis, presentada como un compendio de artículos de investigación, analiza el concepto de índices de validación de clustering y aporta nuevas medidas de bondad para conjuntos de datos que podrían considerarse Big Data debido a su volumen. Además, estas medidas han sido aplicadas en proyectos reales y se propone su aplicación futura para mejorar algoritmos de clustering. El clustering es una de las técnicas de aprendizaje automático no supervisado más usada. Esta técnica nos permite agrupar datos en clusters de manera que, aquellos datos que pertenezcan al mismo cluster tienen características o atributos con valores similares, y a su vez esos datos son disimilares respecto a aquellos que pertenecen a los otros clusters. La similitud de los datos viene dada normalmente por la cercanía en el espacio, teniendo en cuenta una función de distancia. En la literatura existen los llamados índices de validación de clustering, los cuales podríamos definir como medidas para cuantificar la calidad de un resultado de clustering. Estos índices se dividen en dos tipos: índices de validación internos, que miden la calidad del clustering en base a los atributos con los que se han construido los clusters; e índices de validación externos, que son aquellos que cuantifican la calidad del clustering a partir de atributos que no han intervenido en la construcción de los clusters, y que normalmente son de tipo nominal o etiquetas. En esta memoria se proponen dos índices de validación internos para clustering basados en otros índices existentes en la literatura, que nos permiten trabajar con grandes cantidades de datos, ofreciéndonos los resultados en un tiempo razonable. Los índices propuestos han sido testeados en datasets sintéticos y comparados con otros índices de la literatura. Las conclusiones de este trabajo indican que estos índices ofrecen resultados muy prometedores frente a sus competidores. Por otro lado, se ha diseñado un nuevo índice de validación externo de clustering basado en el test estadístico chi cuadrado. Este índice permite medir la calidad del clustering basando el resultado en cómo han quedado distribuidos los clusters respecto a una etiqueta dada en la distribución. Los resultados de este índice muestran una mejora significativa frente a otros índices externos de la literatura y en datasets de diferentes dimensiones y características. Además, estos índices propuestos han sido aplicados en tres proyectos con datos reales cuyas publicaciones están incluidas en esta tesis doctoral. Para el primer proyecto se ha desarrollado una metodología para analizar el consumo eléctrico de los edificios de una smart city. Para ello, se ha realizado un análisis de clustering óptimo aplicando los índices internos mencionados anteriormente. En el segundo proyecto se ha trabajado tanto los índices internos como con los externos para realizar un análisis comparativo del mercado laboral español en dos periodos económicos distintos. Este análisis se realizó usando datos del Ministerio de Trabajo, Migraciones y Seguridad Social, y los resultados podrían tenerse en cuenta para ayudar a la toma de decisión en mejoras de políticas de empleo. En el tercer proyecto se ha trabajado con datos de los clientes de una compañía eléctrica para caracterizar los tipos de consumidores que existen. En este estudio se han analizado los patrones de consumo para que las compañías eléctricas puedan ofertar nuevas tarifas a los consumidores, y éstos puedan adaptarse a estas tarifas con el objetivo de optimizar la generación de energía eliminando los picos de consumo que existen la actualidad.This thesis, presented as a compendium of research articles, analyses the concept of clustering validation indices and provides new measures of goodness for datasets that could be considered Big Data. In addition, these measures have been applied in real projects and their future application is proposed for the improvement of clustering algorithms. Clustering is one of the most popular unsupervised machine learning techniques. This technique allows us to group data into clusters so that the instances that belong to the same cluster have characteristics or attributes with similar values, and are dissimilar to those that belong to the other clusters. The similarity of the data is normally given by the proximity in space, which is measured using a distance function. In the literature, there are so-called clustering validation indices, which can be defined as measures for the quantification of the quality of a clustering result. These indices are divided into two types: internal validation indices, which measure the quality of clustering based on the attributes with which the clusters have been built; and external validation indices, which are those that quantify the quality of clustering from attributes that have not intervened in the construction of the clusters, and that are normally of nominal type or labels. In this doctoral thesis, two internal validation indices are proposed for clustering based on other indices existing in the literature, which enable large amounts of data to be handled, and provide the results in a reasonable time. The proposed indices have been tested with synthetic datasets and compared with other indices in the literature. The conclusions of this work indicate that these indices offer very promising results in comparison with their competitors. On the other hand, a new external clustering validation index based on the chi-squared statistical test has been designed. This index enables the quality of the clustering to be measured by basing the result on how the clusters have been distributed with respect to a given label in the distribution. The results of this index show a significant improvement compared to other external indices in the literature when used with datasets of different dimensions and characteristics. In addition, these proposed indices have been applied in three projects with real data whose corresponding publications are included in this doctoral thesis. For the first project, a methodology has been developed to analyse the electrical consumption of buildings in a smart city. For this study, an optimal clustering analysis has been carried out by applying the aforementioned internal indices. In the second project, both internal and external indices have been applied in order to perform a comparative analysis of the Spanish labour market in two different economic periods. This analysis was carried out using data from the Ministry of Labour, Migration, and Social Security, and the results could be taken into account to help decision-making for the improvement of employment policies. In the third project, data from the customers of an electric company has been employed to characterise the different types of existing consumers. In this study, consumption patterns have been analysed so that electricity companies can offer new rates to consumers. Conclusions show that consumers could adapt their usage to these rates and hence the generation of energy could be optimised by eliminating the consumption peaks that currently exist

    Analysis of the evolution of the Spanish labour market through unsupervised learning

    Get PDF
    Unemployment in Spain is one of the biggest concerns of its inhabitants. Its unemployment rate is the second highest in the European Union, and in the second quarter of 2018 there is a 15.2% unemployment rate, some 3.4 million unemployed. Construction is one of the activity sectors that have suffered the most from the economic crisis. In addition, the economic crisis affected in different ways to the labour market in terms of occupation level or location. The aim of this paper is to discover how the labour market is organised taking into account the jobs that workers get during two periods: 2011-2013, which corresponds to the economic crisis period, and 2014-2016, which was a period of economic recovery. The data used are official records of the Spanish administration corresponding to 1.9 and 2.4 million job placements, respectively. The labour market was analysed by applying unsupervised machine learning techniques to obtain a clear and structured information on the employment generation process and the underlying labour mobility. We have applied two clustering methods with two different technologies, and the results indicate that there were some movements in the Spanish labour market which have changed the physiognomy of some of the jobs. The analysis reveals the changes in the labour market: the crisis forces greater geographical mobility and favours the subsequent emergence of new job sources. Nevertheless, there still exist some clusters that remain stable despite the crisis. We may conclude that we have achieved a characterisation of some important groups of workers in Spain. The methodology used, being supported by Big Data techniques, would serve to analyse any alternative job market.Ministerio de Economía y Competitividad TIN2014-55894-C2-R y TIN2017-88209-C2-2-R, CO2017-8678

    Competición vídeo: ¿cómo puedo saber si el resultado de un clustering es lo suficientemente bueno?

    Get PDF
    En este vídeo divulgativo se hace una introducción a una de las técnicas de análisis de datos más usada, el clustering. En él, se explica a través de sencillos ejemplos qué es el clustering y cómo se analizan las distintas soluciones que ofrece a través de diferentes índices de validació

    External clustering validity index based on chi-squared statistical test

    Get PDF
    Clustering is one of the most commonly used techniques in data mining. Its main goal is to group objects into clusters so that each group contains objects that are more similar to each other than to objects in other clusters. The evaluation of a clustering solution is a task carried out through the application of validity indices. These indices measure the quality of the solution and can be classified as either internal that calculate the quality of the solution through the data of the clusters, or as external indices that measure the quality by means of external information such as the class. Generally, indices from the literature determine their optimal result through graphical representation, whose results could be imprecisely interpreted. The aim of this paper is to present a new external validity index based on the chi-squared statistical test named Chi Index, which presents accurate results that require no further interpretation. Chi Index was analyzed using the clustering results of 3 clustering methods in 47 public datasets. Results indicate a better hit rate and a lower percentage of error against 15 external validity indices from the literature.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-

    An approach to validity indices for clustering techniques in Big Data

    Get PDF
    Clustering analysis is one of the most used Machine Learning techniques to discover groups among data objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist several cluster validity indices that help us to approximate the optimal number of clusters of the dataset. However, such indices are not suitable to deal with Big Data due to its size limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low computational time. Our indices are based on redefinitions of traditional indices by simplifying the intra-cluster distance calculation. Two types of tests have been carried out over 28 synthetic datasets to analyze the performance of the proposed indices. First, we test the indices with small and medium size datasets to verify that our indices have a similar effectiveness to the traditional ones. Subsequently, tests on datasets of up to 11 million records and 20 features have been executed to check their efficiency. The results show that both indices can handle Big Data in a very low computational time with an effectiveness similar to the traditional indices using Apache Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-

    An Approach to Silhouette and Dunn Clustering Indices Applied to Big Data in Spark

    Get PDF
    K-Means and Bisecting K-Means clustering algorithms need the optimal number into which the dataset may be divided. Spark implementations of these algorithms include a method that is used to calculate this number. Unfortunately, this measurement presents a lack of precision because it only takes into account a sum of intra-cluster distances misleading the results. Moreover, this measurement has not been well-contrasted in previous researches about clustering indices. Therefore, we introduce a new Spark implementation of Silhouette and Dunn indices. These clustering indices have been tested in previous works. The results obtained show the potential of Silhouette and Dunn to deal with Big Data.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-

    Temporal convolutional networks applied to energy-related time series forecasting

    Get PDF
    Modern energy systems collect high volumes of data that can provide valuable information about energy consumption. Electric companies can now use historical data to make informed decisions on energy production by forecasting the expected demand. Many deep learning models have been proposed to deal with these types of time series forecasting problems. Deep neural networks, such as recurrent or convolutional, can automatically capture complex patterns in time series data and provide accurate predictions. In particular, Temporal Convolutional Networks (TCN) are a specialised architecture that has advantages over recurrent networks for forecasting tasks. TCNs are able to extract long-term patterns using dilated causal convolutions and residual blocks, and can also be more efficient in terms of computation time. In this work, we propose a TCN-based deep learning model to improve the predictive performance in energy demand forecasting. Two energy-related time series with data from Spain have been studied: the national electric demand and the power demand at charging stations for electric vehicles. An extensive experimental study has been conducted, involving more than 1900 models with different architectures and parametrisations. The TCN proposal outperforms the forecasting accuracy of Long Short-Term Memory (LSTM) recurrent networks, which are considered the state-of-the-art in the field.Ministerio de Economía y Competitividad TIN2017-88209-C2-2-RJunta de Andalucía US-1263341Junta de Andalucía P18-RT-277

    Evaluation of the transformer architecture for univariate time series forecasting

    Get PDF
    The attention-based Transformer architecture is earning in- creasing popularity for many machine learning tasks. In this study, we aim to explore the suitability of Transformers for time series forecasting, which is a crucial problem in di erent domains. We perform an extensive experimental study of the Transformer with di erent architecture and hyper-parameter con gurations over 12 datasets with more than 50,000 time series. The forecasting accuracy and computational e ciency of Transformers are compared with state-of-the-art deep learning networks such as LSTM and CNN. The obtained results demonstrate that Trans- formers can outperform traditional recurrent or convolutional models due to their capacity to capture long-term dependencies, obtaining the most accurate forecasts in ve out of twelve datasets. However, Transformers are generally more di cult to parametrize and show higher variability of results. In terms of e ciency, Transformer models proved to be less competitive in inference time and similar to the LSTM in training time.Ministerio de Ciencia, Innovación y Universidades TIN2017-88209-C2Junta de Andalucía US-1263341Junta de Andalucía P18-RT-277

    Statistically Representative Metrology of Nanoparticles via Unsupervised Machine Learning of TEM Images

    Get PDF
    The morphology of nanoparticles governs their properties for a range of important applica tions. Thus, the ability to statistically correlate this key particle performance parameter is paramount in achieving accurate control of nanoparticle properties. Among several effective techniques for morphological characterization of nanoparticles, transmission electron microscopy (TEM) can pro vide a direct, accurate characterization of the details of nanoparticle structures and morphology at atomic resolution. However, manually analyzing a large number of TEM images is laborious. In this work, we demonstrate an efficient, robust and highly automated unsupervised machine learning method for the metrology of nanoparticle systems based on TEM images. Our method not only can achieve statistically significant analysis, but it is also robust against variable image quality, imaging modalities, and particle dispersions. The ability to efficiently gain statistically significant particle metrology is critical in advancing precise particle synthesis and accurate property control.Australia Research Council (ARC) IC210100056Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-

    Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities

    Get PDF
    New technologies such as sensor networks have been incorporated into the management of buildings for organizations and cities. Sensor networks have led to an exponential increase in the volume of data available in recent years, which can be used to extract consumption patterns for the purposes of energy and monetary savings. For this reason, new approaches and strategies are needed to analyze information in big data environments. This paper proposes a methodology to extract electric energy consumption patterns in big data time series, so that very valuable conclusions can be made for managers and governments. The methodology is based on the study of four clustering validity indices in their parallelized versions along with the application of a clustering technique. In particular, this work uses a voting system to choose an optimal number of clusters from the results of the indices, as well as the application of the distributed version of the k-means algorithm included in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition, the performance of the proposed methodology is evaluated using synthetic big data, which cab represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered are proposed to optimize energy usage across the university campus.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-RJunta de Andalucía P12-TIC-172
    corecore