12 research outputs found
ETL and analysis of IoT data using OpenTSDB, Kafka, and Spark
Master's thesis in Computer scienceThe Internet of Things (IoT) is becoming increasingly prevalent in today's society. Innovations in storage and processing methodologies enable the processing of large amounts of data in a scalable manner, and generation of insights in near real-time. Data from IoT are typically time-series data but they may also have a strong spatial correlation. In addition, many time-series data are deployed in industries that still place the data in inappropriate relational databases.
Many open-source time-series databases exist today with inspiring features in terms of storage, analytic representation, and visualization. Finding an efficient method to migrate data into a time-series database is the first objective of the thesis.
In recent decades, machine learning has become one of the backbones of data innovation. With the constantly expanding amounts of information available, there is good reason to expect that smart data analysis will become more pervasive as an essential element for innovative progress. Methods for modeling time-series data in machine learning and migrating time-series data from a database to a big data machine learning framework, such as Apache Spark, is explored in this thesis
BDWatchdog: real-time monitoring and profiling of Big Data applications and frameworks
This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2017.12.068[Abstract] Current Big Data applications are characterized by a heavy use of system resources (e.g., CPU, disk) generally distributed across a cluster. To effectively improve their performance there is a critical need for an accurate analysis of both Big Data workloads and frameworks. This means to fully understand how the system resources are being used in order to identify potential bottlenecks, from resource to code bottlenecks. This paper presents BDWatchdog, a novel framework that allows real-time and scalable analysis of Big Data applications by combining time series for resource monitorization and flame graphs for code profiling, focusing on the processes that make up the workload rather than the underlying instances on which they are executed. This shift from the traditional system-based monitorization to a process-based analysis is interesting for new paradigms such as software containers or serverless computing, where the focus is put on applications and not on instances. BDWatchdog has been evaluated on a Big Data cloud-based service deployed at the CESGA supercomputing center. The experimental results show that a process-based analysis allows for a more effective visualization and overall improves the understanding of Big Data workloads. BDWatchdog is publicly available at http://bdwatchdog.dec.udc.es.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinsiterio de Educación; FPU15/0338
Real-time resource scaling platform for Big Data workloads on serverless environments
Versión final aceptada de: https://doi.org/10.1016/j.future.2019.11.037This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-ncnd/
4.0/. This version of the article: Enes, J., Expósito, R. R., & Touriño, J. (2020). 'Real-time resource scaling platform for Big Data workloads
on serverless environments', has been accepted for publication in.: Future Generation Computer Systems, 105, 361–379. The Version of
Record is available online at: https://doi.org/10.1016/j.future.2019.11.037.The serverless execution paradigm is becoming an increasingly popular option when workloads are to be deployed in an abstracted way, more specifically, without specifying any infrastructure requirements. Currently, such workloads are typically comprised of small programs or even a series of single functions used as event triggers or to process a data stream. Other applications that may also fit on a serverless scenario are stateless services that may need to seamlessly scale in terms of resources, such as a web server. Although several commercial serverless services are available (e.g., Amazon Lambda), their use cases are mostly limited to the execution of functions or scripts that can be adapted to predefined templates or specifications. However, current research efforts point out that it is interesting for the serverless paradigm to evolve from single functions and support more flexible infrastructure units such as operating-system-level virtualization in the form of containers. In this paper we present a novel platform to automatically scale container resources in real time, while they are running, and without any need for reboots. This platform is evaluated using Big Data workloads, both batch and streaming, as representative examples of applications that could be initially regarded as unsuitable for the serverless paradigm considering the currently available services. The results show how our serverless platform can improve the CPU utilization by up to 77% with an execution time overhead of only 6%, while remaining scalable when using a 32-container cluster.This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union (project TIN2016-75845-P, AEI/FEDER/EU), the FPU Program of the Ministry of Education, Spain (grant FPU15/03381) and by Xunta de Galicia, Spain (Centro Singular de Investigación de Galicia accreditation 2016–2019, ref. ED431G/01). We also gratefully acknowledge CESGA for providing access to the Big Data infrastructure, and also sincerely thank Dr. Javier López Cacheiro for his technical support to perform some of the experiments. Other experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations.Xunta de Galicia; ED431G/0
Time series database in Industrial IoT and its testing tool
Abstract. In the essence of the Industrial Internet of Things is data gathering. Data is time and event-based and hence time series data is key concept in the Industrial Internet of Things, and specific time series database is required to process and store the data. Solution development and choosing the right time series database for Industrial Internet of Things solution can be difficult. Inefficient comparison of time series databases can lead to wrong choices and consequently to delays and financial losses. This thesis is improving the tools to compare different time series databases in context of the Industrial Internet of Things. In addition, the thesis identifies the functional and non-functional requirements of time series database in Industrial Internet of Things and designs and implements a performance test bench. A practical example of how time series databases can be compared with identified requirements and developed test bench is also provided. The example is used to examine how selected time series databases fulfill these requirements.
Eight functional requirements and eight non-functional requirements were identified. Functional requirements included, e.g., aggregation support, information models, and hierarchical configurations. Non-functional requirements included, e.g., scalability, performance, and lifecycle. Developed test bench took Industrial Internet of Things point of view by testing the database in three scenarios: write heavy, read heavy, and concurrent write and read operations. In the practical example, ABB’s cpmPlus History, InfluxDB, and TimescaleDB were evaluated.
Both requirement evaluation and performance testing resulted that cpmPlus History performed best, InfluxDB second best, and TimescaleDB the worst. cpmPlus History showed extensive support for the requirements and best performance in all performance test cases. InfluxDB showed high performance for data writing while TimescaleDB showed better performance for data reading.Aikasarjatietokanta teollisuuden esineiden internetissä ja sen testipenkki. Tiivistelmä. Teollisuuden esineiden internetin ytimessä on tiedon keruu. Tieto on aika ja tapahtuma pohjaista ja sen vuoksi aikasarjatieto on teollisuuden esineiden internetin avainkäsitteitä. Prosessoidakseen tällaista tietoa tarvitaan erityinen aikasarjatietokanta. Sovelluskehitys ja oikean aikasarjatietokannan valitseminen teollisuuden esineiden internetin ratkaisuun voi olla vaikeaa. Tehoton aikasarjatietokantojen vertailu voi johtaa vääriin valintoihin ja siten viiveisiin sekä taloudellisiin tappioihin. Tässä diplomityössä kehitetään työkaluja, joilla eri aikasarjatietokantoja teollisuuden esineiden internetin ympäristössä voidaan vertailla. Diplomityössä tunnistetaan toiminnalliset ja ei-toiminnalliset vaatimukset aikasarjatietokannalle teollisuuden esineiden internetissä ja suunnitellaan ja toteutetaan suorituskykytestipenkki aikasarjatietokannoille. Työ tarjoaa myös käytännön esimerkin kuinka aikasarjatietokantoja voidaan vertailla tunnistetuilla vaatimuksilla ja kehitetyllä testipenkillä. Esimerkkiä hyödynnetään tutkimuksessa, jossa selvitetään kuinka nykyiset aikasarjatietokannat täyttävät tunnistetut vaatimukset.
Diplomityössä tunnistettiin kahdeksan toiminnallista ja kahdeksan ei-toiminnallista vaatimusta. Toiminnallisiin vaatimuksiin sisältyi mm. aggregoinnin tukeminen, informaatiomallit ja hierarkkiset konfiguraatiot. Ei-toiminnallisiin vaatimuksiin sisältyi mm. skaalautuvuus, suorituskyky ja elinkaari. Kehitetty testipenkki otti teollisuuden esineiden internetin näkökulman kolmella eri testiskenaariolla: kirjoituspainoitteinen, lukemispainoitteinen ja yhtäaikaiset kirjoitus- ja lukemisoperaatiot. Käytännön esimerkissä ABB:n cpmPlus History, InfluxDB ja TimescaleDB tietokannat olivat arvioitavina.
Sekä vaatimusten arviointi että suorituskykytestit osoittivat cpmPlus History:n suoriutuvan parhaiten, InfluxDB:n toiseksi parhaiten ja TimescaleDB:n huonoiten. cpmPlus History tuki tunnistettuja vaatimuksia laajimmin ja tarjosi parhaan suorituskyvyn kaikissa testiskenaarioissa. InfluxDB antoi hyvän suorituskyvyn tiedon kirjoittamiselle, kun vastaavasti TimescaleDB osoitti parempaa suorituskykyä tiedon lukemisessa
Time Series Management Systems:A Survey
The collection of time series data increases as more monitoring and
automation are being deployed. These deployments range in scale from an
Internet of things (IoT) device located in a household to enormous distributed
Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity.
To store and analyze these vast amounts of data, specialized Time Series
Management Systems (TSMSs) have been developed to overcome the limitations of
general purpose Database Management Systems (DBMSs) for times series
management. In this paper, we present a thorough analysis and classification of
TSMSs developed through academic or industrial research and documented through
publications. Our classification is organized into categories based on the
architectures observed during our analysis. In addition, we provide an overview
of each system with a focus on the motivational use case that drove the
development of the system, the functionality for storage and querying of time
series a system implements, the components the system is composed of, and the
capabilities of each system with regard to Stream Processing and Approximate
Query Processing (AQP). Last, we provide a summary of research directions
proposed by other researchers in the field and present our vision for a next
generation TSMS.Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKD
Análise colaborativa de grandes conjuntos de séries temporais
The recent expansion of metrification on a daily basis has led to the production
of massive quantities of data, and in many cases, these collected metrics
are only useful for knowledge building when seen as a full sequence of
data ordered by time, which constitutes a time series. To find and interpret
meaningful behavioral patterns in time series, a multitude of analysis software
tools have been developed. Many of the existing solutions use annotations
to enable the curation of a knowledge base that is shared between a group
of researchers over a network. However, these tools also lack appropriate
mechanisms to handle a high number of concurrent requests and to properly
store massive data sets and ontologies, as well as suitable representations
for annotated data that are visually interpretable by humans and explorable by
automated systems. The goal of the work presented in this dissertation is to
iterate on existing time series analysis software and build a platform for the
collaborative analysis of massive time series data sets, leveraging state-of-the-art technologies for querying, storing and displaying time series and annotations.
A theoretical and domain-agnostic model was proposed to enable
the implementation of a distributed, extensible, secure and high-performant
architecture that handles various annotation proposals in simultaneous and
avoids any data loss from overlapping contributions or unsanctioned changes.
Analysts can share annotation projects with peers, restricting a set of collaborators
to a smaller scope of analysis and to a limited catalog of annotation
semantics. Annotations can express meaning not only over a segment of time,
but also over a subset of the series that coexist in the same segment. A novel
visual encoding for annotations is proposed, where annotations are rendered
as arcs traced only over the affected series’ curves in order to reduce visual
clutter. Moreover, the implementation of a full-stack prototype with a reactive
web interface was described, directly following the proposed architectural and
visualization model while applied to the HVAC domain. The performance of
the prototype under different architectural approaches was benchmarked, and
the interface was tested in its usability. Overall, the work described in this dissertation
contributes with a more versatile, intuitive and scalable time series
annotation platform that streamlines the knowledge-discovery workflow.A recente expansão de metrificação diária levou à produção de quantidades
massivas de dados, e em muitos casos, estas métricas são úteis para
a construção de conhecimento apenas quando vistas como uma sequência
de dados ordenada por tempo, o que constitui uma série temporal. Para se
encontrar padrões comportamentais significativos em séries temporais, uma
grande variedade de software de análise foi desenvolvida. Muitas das soluções
existentes utilizam anotações para permitir a curadoria de uma base
de conhecimento que é compartilhada entre investigadores em rede. No entanto,
estas ferramentas carecem de mecanismos apropriados para lidar com
um elevado número de pedidos concorrentes e para armazenar conjuntos
massivos de dados e ontologias, assim como também representações apropriadas
para dados anotados que são visualmente interpretáveis por seres
humanos e exploráveis por sistemas automatizados. O objetivo do trabalho
apresentado nesta dissertação é iterar sobre o software de análise de séries
temporais existente e construir uma plataforma para a análise colaborativa
de grandes conjuntos de séries temporais, utilizando tecnologias estado-de-arte
para pesquisar, armazenar e exibir séries temporais e anotações. Um
modelo teórico e agnóstico quanto ao domínio foi proposto para permitir a
implementação de uma arquitetura distribuída, extensível, segura e de alto
desempenho que lida com várias propostas de anotação em simultâneo e
evita quaisquer perdas de dados provenientes de contribuições sobrepostas
ou alterações não-sancionadas. Os analistas podem compartilhar projetos
de anotação com colegas, restringindo um conjunto de colaboradores a uma
janela de análise mais pequena e a um catálogo limitado de semântica de
anotação. As anotações podem exprimir significado não apenas sobre um
intervalo de tempo, mas também sobre um subconjunto das séries que coexistem
no mesmo intervalo. Uma nova codificação visual para anotações é
proposta, onde as anotações são desenhadas como arcos traçados apenas
sobre as curvas de séries afetadas de modo a reduzir o ruído visual. Para
além disso, a implementação de um protótipo full-stack com uma interface
reativa web foi descrita, seguindo diretamente o modelo de arquitetura e visualização
proposto enquanto aplicado ao domínio AVAC. O desempenho do
protótipo com diferentes decisões arquiteturais foi avaliado, e a interface foi
testada quanto à sua usabilidade. Em geral, o trabalho descrito nesta dissertação
contribui com uma abordagem mais versátil, intuitiva e escalável para
uma plataforma de anotação sobre séries temporais que simplifica o fluxo de
trabalho para a descoberta de conhecimento.Mestrado em Engenharia Informátic