Search CORE

43 research outputs found

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Author: Begoli Edmon
Hyde Julian
Lemire Daniel
Mior Michael J.
Rodríguez Jesús Camacho
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2018
Field of study

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

arXiv.org e-Print Archive

R-libre

Crossref

Open Source Platforms for Big Data Analytics

Author: Nereu Jorge Filipe Cândido
Publication venue
Publication date: 01/01/2017
Field of study

O conceito de Big Data tem tido um grande impacto no campo da tecnologia, em particular na gestão e análise de enormes volumes de informação. Atualmente, as organizações consideram o Big Data como uma oportunidade para gerir e explorar os seus dados o máximo possível, com o objetivo de apoiar as suas decisões dentro das diferentes áreas operacionais. Assim, é necessário analisar vários conceitos sobre o Big Data e o Big Data Analytics, incluindo definições, características, vantagens e desafios. As ferramentas de Business Intelligence (BI), juntamente com a geração de conhecimento, são conceitos fundamentais para o processo de tomada de decisão e transformação da informação. Ao investigar as plataformas de Big Data, as práticas industriais atuais e as tendências relacionadas com o mundo da investigação, é possível entender o impacto do Big Data Analytics nas pequenas organizações. Este trabalho pretende propor soluções para as micro, pequenas ou médias empresas (PME) que têm um grande impacto na economia portuguesa, dado que representam a maioria do tecido empresarial. As plataformas de código aberto para o Big Data Analytics oferecem uma grande oportunidade de inovação nas PMEs. Este trabalho de pesquisa apresenta uma análise comparativa das funcionalidades e características das plataformas e os passos a serem tomados para uma análise mais profunda e comparativa. Após a análise comparativa, apresentamos uma avaliação e seleção de plataformas Big Data Analytics (BDA) usando e adaptando a metodologia QSOS (Qualification and Selection of software Open Source) para qualificação e seleção de software open-source. O resultado desta avaliação e seleção traduziu-se na eleição de duas plataformas para os testes experimentais. Nas plataformas de software livre de BDA foi usado o mesmo conjunto de dados assim como a mesma configuração de hardware e software. Na comparação das duas plataformas, demonstrou que a HPCC Systems Platform é mais eficiente e confiável que a Hortonworks Data Platform. Em particular, as PME portuguesas devem considerar as plataformas BDA como uma oportunidade de obter vantagem competitiva e melhorar os seus processos e, consequentemente, definir uma estratégia de TI e de negócio. Por fim, este é um trabalho sobre Big Data, que se espera que sirva como um convite e motivação para novos trabalhos de investigação.The concept of Big Data has been having a great impact in the field of technology, particularly in the management and analysis of huge volumes of information. Nowadays organizations look for Big Data as an opportunity to manage and explore their data the maximum they can, with the objective of support decisions within its different operational areas. Thus, it is necessary to analyse several concepts about Big Data and Big Data Analytics, including definitions, features, advantages and disadvantages. Business intelligence along with the generation of knowledge are fundamental concepts for the process of decision-making and transformation of information. By investigate today's big data platforms, current industrial practices and related trends in the research world, it is possible to understand the impact of Big Data Analytics on small organizations. This research intends to propose solutions for micro, small or médium enterprises (SMEs) that have a great impact on the Portuguese economy since they represente approximately 90% of the companies in Portugal. The open source platforms for Big Data Analytics offers a great opportunity for SMEs. This research work presents a comparative analysis of those platforms features and functionalities and the steps that will be taken for a more profound and comparative analysis. After the comparative analysis, we present an evaluation and selection of Big Data Analytics (BDA) platforms using and adapting the Qualification and Selection of software Open Source (QSOS) method. The result of this evaluation and selection was the selection of two platforms for the empirical experiment and tests. The same testbed and dataset was used in the two Open Source Big Data Analytics platforms. When comparing two BDA platforms, HPCC Systems Platform is found to be more efficient and reliable than Hortonworks Data Platform. In particular, Portuguese SMEs should consider for BDA platforms an opportunity to obtain competitive advantage and improve their processes and consequently define an IT and business strategy. Finally, this is a research work on Big Data; it is hoped that this will serve as an invitation and motivation for new research

Repositório Científico do Instituto Politécnico do Porto

Benchmarking Distributed Stream Data Processing Systems

Author: Heiskanen Henri
Karimov Jeyhun
Katsifodimos Asterios
Markl Volker
Rabl Tilmann
Samarev Roman
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/06/2019
Field of study

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.Comment: Published at ICDE 201

arXiv.org e-Print Archive

Crossref

Design and evaluation of a cloud native data analysis pipeline for cyber physical production systems

Author: Ferrer Daub Facundo Javier
Publication venue
Publication date: 01/03/2017
Field of study

Since 1991 with the birth of the World Wide Web the rate of data growth has been growing with a record level in the last couple of years. Big companies tackled down this data growth with expensive and enormous data centres to process and get value of this data. From social media, Internet of Things (IoT), new business process, monitoring and multimedia, the capacities of those data centres started to be a problem and required continuos and expensive expansion. Thus, Big Data was something that only a few were able to access. This changed fast when Amazon launched Amazon Web Services (AWS) around 15 years ago and gave the origins to the public cloud. At that time, the capabilities were still very new and reduced but 10 years later the cloud was a whole new business that changed for ever the Big Data business. This not only commoditised computer power but it was accompanied by a price model that let medium and small players the possibility to access it. In consequence, new problems arised regarding the nature of these distributed systems and the software architectures required for proper data processing. The present job analyse the type of typical Big Data workloads and propose an architecture for a cloud native data analysis pipeline. Lastly, it provides a chapter for tools and services that can be used in the architecture taking advantage of their open source nature and the cloud price models.Fil: Ferrer Daub, Facundo Javier. Universidad Católica de Córdoba. Instituto de Ciencias de la Administración; Argentin

Producción Académica

Data Systems Fault Coping for Real-time Big Data Analytics Required Architectural Crucibles

Author: Cohen Stephen
Money William H.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2017
Field of study

This paper analyzes the properties and characteristics of unknown and unexpected faults introduced into information systems while processing Big Data in real-time. The authors hypothesize that there are new faults, and requirements for fault handling and propose an analytic model and architectural framework to assess and manage the faults and mitigate the risks of correlating or integrating otherwise uncorrelated Big Data, and to ensure the source pedigree, quality, set integrity, freshness, and validity of data being consumed. We argue that new architectures, methods, and tools for handling and analyzing Big Data systems functioning in real-time must design systems that address and mitigate concerns for faults resulting from real-time streaming processes while ensuring that variables such as synchronization, redundancy, and latency are addressed. This paper concludes that with improved designs, real-time Big Data systems may continuously deliver the value and benefits of streaming Big Data

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

Integração de dados de sensores e gestão de ambientes inteligentes

Author: Moreira Helder
Publication venue: Universidade de Aveiro
Publication date: 01/01/2016
Field of study

Mestrado em Engenharia de Computadores e TelemáticaNum mundo de constante desenvolvimento tecnológico e acelerado crescimento populacional, observa-se um aumento da utilização de recursos energéticos. Sendo os edifícios responsáveis por uma grande parte deste consumo energético, desencadeiam-se vários esforços de investigações de forma a criarem-se edifícios energeticamente eficientes e espaços inteligentes. Esta dissertação visa, numa primeira fase, apresentar uma revisão das atuais soluções que combinam sistemas de automação de edifícios e a Internet das Coisas. Posteriormente, é apresentada uma solução de automação para edifícios, com base em princípios da Internet das Coisas e explorando as vantagens de sistemas de processamento complexo de eventos, de forma a fornecer uma maior integração dos múltiplos sistemas existentes num edifício. Esta solução é depois validada através de uma implementação, baseada em protocolos leves desenhados para a Internet das Coisas, plataformas de alto desempenho, e métodos complexos para análise de grandes fluxos de dados. Esta implementação é ainda aplicada num cenário real, e será usada como a solução padrão para gestão e automação num edifício existente.In a world of constant technological development and accelerated population growth, an increased use of energy resources is being observed. With buildings responsible for a large share of this energy consumption, a lot of research activities are pursued with the goal to create energy efficient buildings and smart spaces. This dissertation aims to, in a first stage, present a review of the current solutions combining Building Automation Systems (BAS) and Internet of Things (IoT). Then, a solution for building automation is presented based on IoT principles and exploiting the advantages of Complex Event Processing (CEP) systems, to provide higher integration of the multiple building subsystems. This solution was validated through an implementation, based on standard lightweight protocols designed for IoT, high performance and real time platforms, and complex methods for analysis of large streams of data. The implementation is also applied to a real world scenario, and will be used as a standard solution for management and automation of an existing buildin

Repositório Institucional da Universidade de Aveiro