86,986 research outputs found
An approach to validity indices for clustering techniques in Big Data
Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-
23-bit Metaknowledge Template Towards Big Data Knowledge Discovery and Management
The global influence of Big Data is not only growing but seemingly endless.
The trend is leaning towards knowledge that is attained easily and quickly from
massive pools of Big Data. Today we are living in the technological world that
Dr. Usama Fayyad and his distinguished research fellows discussed in the
introductory explanations of Knowledge Discovery in Databases (KDD) predicted
nearly two decades ago. Indeed, they were precise in their outlook on Big Data
analytics. In fact, the continued improvement of the interoperability of
machine learning, statistics, database building and querying fused to create
this increasingly popular science- Data Mining and Knowledge Discovery. The
next generation computational theories are geared towards helping to extract
insightful knowledge from even larger volumes of data at higher rates of speed.
As the trend increases in popularity, the need for a highly adaptive solution
for knowledge discovery will be necessary. In this research paper, we are
introducing the investigation and development of 23 bit-questions for a
Metaknowledge template for Big Data Processing and clustering purposes. This
research aims to demonstrate the construction of this methodology and proves
the validity and the beneficial utilization that brings Knowledge Discovery
from Big Data.Comment: IEEE Data Science and Advanced Analytics (DSAA'2014
Optimization of Columnar NoSQL Data Warehouse Model with Clarans Clustering Algorithm
In order to perfectly meet the needs of business leaders, decision-makers have resorted to the integration of external sources (such as Linked Open Data) in the decision-making system in order to enrich their existing data warehouses with new concepts contributing to bring added value to their organizations, enhance its productivity and retain its customers. However, the traditional data warehouse environment is not suitable to support external Big Data. To deal with this new challenge, several researches are oriented towards the direct conversion of classical relational data warehouse to a columnar NoSQL data warehouse, whereas the existing advanced works based on clustering algorithms are very limited and have several shortcomings. In this context, our paper proposes a new solution that conceives an optimized columnar data warehouse based on CLARANS clustering algorithm that has proven its effectiveness in generating optimal column families. Experimental results improve the validity of our system by performing a detailed comparative study between the existing advanced approaches and our proposed optimized method
Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities
New technologies such as sensor networks have been incorporated into the management
of buildings for organizations and cities. Sensor networks have led to an exponential increase in the
volume of data available in recent years, which can be used to extract consumption patterns for the
purposes of energy and monetary savings. For this reason, new approaches and strategies are needed
to analyze information in big data environments. This paper proposes a methodology to extract
electric energy consumption patterns in big data time series, so that very valuable conclusions can
be made for managers and governments. The methodology is based on the study of four clustering
validity indices in their parallelized versions along with the application of a clustering technique.
In particular, this work uses a voting system to choose an optimal number of clusters from the results
of the indices, as well as the application of the distributed version of the k-means algorithm included
in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the
years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition,
the performance of the proposed methodology is evaluated using synthetic big data, which cab
represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered
are proposed to optimize energy usage across the university campus.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-RJunta de Andalucía P12-TIC-172
Clustering Methods for Electricity Consumers: An Empirical Study in Hvaler-Norway
The development of Smart Grid in Norway in specific and Europe/US in general
will shortly lead to the availability of massive amount of fine-grained
spatio-temporal consumption data from domestic households. This enables the
application of data mining techniques for traditional problems in power system.
Clustering customers into appropriate groups is extremely useful for operators
or retailers to address each group differently through dedicated tariffs or
customer-tailored services. Currently, the task is done based on demographic
data collected through questionnaire, which is error-prone. In this paper, we
used three different clustering techniques (together with their variants) to
automatically segment electricity consumers based on their consumption
patterns. We also proposed a good way to extract consumption patterns for each
consumer. The grouping results were assessed using four common internal
validity indexes. We found that the combination of Self Organizing Map (SOM)
and k-means algorithms produce the most insightful and useful grouping. We also
discovered that grouping quality cannot be measured effectively by automatic
indicators, which goes against common suggestions in literature.Comment: 12 pages, 3 figure
Reinforcement machine learning for predictive analytics in smart cities
The digitization of our lives cause a shift in the data production as well as in the required data management. Numerous nodes are capable of producing huge volumes of data in our everyday activities. Sensors, personal smart devices as well as the Internet of Things (IoT) paradigm lead to a vast infrastructure that covers all the aspects of activities in modern societies. In the most of the cases, the critical issue for public authorities (usually, local, like municipalities) is the efficient management of data towards the support of novel services. The reason is that analytics provided on top of the collected data could help in the delivery of new applications that will facilitate citizens’ lives. However, the provision of analytics demands intelligent techniques for the underlying data management. The most known technique is the separation of huge volumes of data into a number of parts and their parallel management to limit the required time for the delivery of analytics. Afterwards, analytics requests in the form of queries could be realized and derive the necessary knowledge for supporting intelligent applications. In this paper, we define the concept of a Query Controller ( QC ) that receives queries for analytics and assigns each of them to a processor placed in front of each data partition. We discuss an intelligent process for query assignments that adopts Machine Learning (ML). We adopt two learning schemes, i.e., Reinforcement Learning (RL) and clustering. We report on the comparison of the two schemes and elaborate on their combination. Our aim is to provide an efficient framework to support the decision making of the QC that should swiftly select the appropriate processor for each query. We provide mathematical formulations for the discussed problem and present simulation results. Through a comprehensive experimental evaluation, we reveal the advantages of the proposed models and describe the outcomes results while comparing them with a deterministic framework
Analysis of the evolution of the Spanish labour market through unsupervised learning
Unemployment in Spain is one of the biggest concerns of its inhabitants. Its unemployment rate is the second highest in the European Union, and in the second quarter of 2018 there is a 15.2% unemployment rate, some 3.4 million unemployed. Construction is one of the activity sectors that have suffered the most from the economic crisis. In addition, the economic crisis affected in different ways to the labour market in terms of occupation level or location. The aim of this paper is to discover how the labour market is organised taking into account the jobs that workers get during two periods: 2011-2013, which corresponds to the economic crisis period, and 2014-2016, which was a period of economic recovery. The data used are official records of the Spanish administration corresponding to 1.9 and 2.4 million job placements, respectively. The labour market was analysed by applying unsupervised machine learning techniques to obtain a clear and structured information on the employment generation process and the underlying labour mobility. We have applied two clustering methods with two different technologies, and the results indicate that there were some movements in the Spanish labour market which have changed the physiognomy of some of the jobs. The analysis reveals the changes in the labour market: the crisis forces greater geographical mobility and favours the subsequent emergence of new job sources. Nevertheless, there still exist some clusters that remain stable despite the crisis. We may conclude that we have achieved a characterisation of some important groups of workers in Spain. The methodology used, being supported by Big Data techniques, would serve to analyse any alternative job market.Ministerio de Economía y Competitividad TIN2014-55894-C2-R y TIN2017-88209-C2-2-R, CO2017-8678
Typical Phone Use Habits: Intense Use Does Not Predict Negative Well-Being
Not all smartphone owners use their device in the same way. In this work, we
uncover broad, latent patterns of mobile phone use behavior. We conducted a
study where, via a dedicated logging app, we collected daily mobile phone
activity data from a sample of 340 participants for a period of four weeks.
Through an unsupervised learning approach and a methodologically rigorous
analysis, we reveal five generic phone use profiles which describe at least 10%
of the participants each: limited use, business use, power use, and
personality- & externally induced problematic use. We provide evidence that
intense mobile phone use alone does not predict negative well-being. Instead,
our approach automatically revealed two groups with tendencies for lower
well-being, which are characterized by nightly phone use sessions.Comment: 10 pages, 6 figures, conference pape
- …