16 research outputs found

    Challenging SQL-on-Hadoop performance with Apache Druid

    Get PDF
    In Big Data, SQL-on-Hadoop tools usually provide satisfactory performance for processing vast amounts of data, although new emerging tools may be an alternative. This paper evaluates if Apache Druid, an innovative column-oriented data store suited for online analytical processing workloads, is an alternative to some of the well-known SQL-on-Hadoop technologies and its potential in this role. In this evaluation, Druid, Hive and Presto are benchmarked with increasing data volumes. The results point Druid as a strong alternative, achieving better performance than Hive and Presto, and show the potential of integrating Hive and Druid, enhancing the potentialities of both tools.This work is supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundacao para a Ciencia e Tecnologia within Project UID/CEC/00319/2013 and by European Structural and Investment Funds in the FEDER component, COMPETE 2020 (Funding Reference: POCI-01-0247-FEDER-002814)

    Intelligent event broker: a complex event processing system in big data contexts

    Get PDF
    In Big Data contexts, many batch and streaming oriented technologies have emerged to deal with the high valuable sources of events, such as Internet of Things (IoT) platforms, the Web, several types of databases, among others. The huge amount of heterogeneous data being constantly generated by a world of interconnected things and the need for (semi)-automated decision-making processes through Complex Event Processing (CEP) and Machine Learning (ML) have raised the need for innovative architectures capable of processing events in a streamlined, scalable, analytical, and integrated way. This paper presents the Intelligent Event Broker, a CEP system built upon flexible and scalable Big Data techniques and technologies, highlighting its system architecture, software packages, and classes. A demonstration case in Bosch’s Industry 4.0 context is presented, detailing how the system can be used to manage and improve the quality of the manufacturing process, showing its usefulness for solving real-world event-oriented problems.This work has been supported by FCT –Fundação para a Ciência e Tecnologiawithin the Project Scope: UID/CEC/00319/2019 and the Doctoral scholarship PD/BDE/135101/2017. This paper uses icons made by Freepik, from www.flaticon.com

    Advancing logistics 4.0 with the implementation of a big data warehouse: a demonstration case for the automotive industry

    Get PDF
    The constant advancements in Information Technology have been the main driver of the Big Data concept’s success. With it, new concepts such as Industry 4.0 and Logistics 4.0 are arising. Due to the increase in data volume, velocity, and variety, organizations are now looking to their data analytics infrastructures and searching for approaches to improve their decision-making capabilities, in order to enhance their results using new approaches such as Big Data and Machine Learning. The implementation of a Big Data Warehouse can be the first step to improve the organizations’ data analysis infrastructure and start retrieving value from the usage of Big Data technologies. Moving to Big Data technologies can provide several opportunities for organizations, such as the capability of analyzing an enormous quantity of data from different data sources in an efficient way. However, at the same time, different challenges can arise, including data quality, data management, and lack of knowledge within the organization, among others. In this work, we propose an approach that can be adopted in the logistics department of any organization in order to promote the Logistics 4.0 movement, while highlighting the main challenges and opportunities associated with the development and implementation of a Big Data Warehouse in a real demonstration case at a multinational automotive organization.This work was supported by FCT–Fundação para a Ciência e Tecnologia—within the R&D Units Project Scope: UIDB/00319/2020 and doctoral scholarship grants: PD/BDE/142895/2018 and PD/BDE/142900/2018

    Supply chain hybrid simulation: From Big Data to distributions and approaches comparison

    Get PDF
    The uncertainty and variability of Supply Chains paves the way for simulation to be employed to mitigate such risks. Due to the amounts of data generated by the systems used to manage relevant Supply Chain processes, it is widely recognized that Big Data technologies may bring benefits to Supply Chain simulation models. Nevertheless, a simulation model should also consider statistical distributions, which allow it to be used for purposes such as testing risk scenarios or for prediction. However, when Supply Chains are complex and of huge-scale, performing distribution fitting may not be feasible, which often results in users focusing on subsets of problems or selecting samples of elements, such as suppliers or materials. This paper proposed a hybrid simulation model that runs using data stored in a Big Data Warehouse, statistical distributions or a combination of both approaches. The results show that the former approach brings benefits to the simulations and is essential when setting the model to run based on statistical distributions. Furthermore, this paper also compared these approaches, emphasizing the pros and cons of each, as well as their differences in computational requirements, hence establishing a milestone for future researches in this domain.This work has been supported by national funds through FCT -Fundacao para a Ciencia e Tecnologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    On the use of simulation as a Big Data semantic validator for supply chain management

    Get PDF
    Simulation stands out as an appropriate method for the Supply Chain Management (SCM) field. Nevertheless, to produce accurate simulations of Supply Chains (SCs), several business processes must be considered. Thus, when using real data in these simulation models, Big Data concepts and technologies become necessary, as the involved data sources generate data at increasing volume, velocity and variety, in what is known as a Big Data context. While developing such solution, several data issues were found, with simulation proving to be more efficient than traditional data profiling techniques in identifying them. Thus, this paper proposes the use of simulation as a semantic validator of the data, proposed a classification for such issues and quantified their impact in the volume of data used in the final achieved solution. This paper concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs. In fact, it was also found that simulation can help in identifying and bypassing some of these issues.This work has been supported by FCT (Fundacao para a Ciencia e Tecnologia) within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

    Get PDF
    Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER com-ponent, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814]

    Apache Kudu: vantagens e desvantagens na análise de vastas quantidades de dados

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoDurante os últimos anos, temos assistido a um aumento exponencial da quantidade de dados produzidos. Este aumento deve-se, principalmente, à enorme utilização de sensores, assim como à massificação da utilização das redes sociais e de dispositivos móveis que, em permanência, recolhem dados de diversos tipos e contextos. O tratamento e análise destes dados por parte das organizações traduz-se numa inegável vantagem competitiva nos mercados, cada vez mais exigentes. Por este motivo, o estudo e desenvolvimento de novas ferramentas para a exploração destes dados tem atraído a atenção das organizações e também da comunidade científica, uma vez que as técnicas e tecnologia tradicionais se têm mostrado incapazes de lidar com dados de tal natureza. Neste contexto, surge o termo Big Data, utilizado para definir este tipo de dados de grande volume, diferentes graus de complexidade e, por vezes, não estruturados ou com um modelo de dados pré-definido. Associado ao termo Big Data surgem novos repositórios de dados com modelos lógicos próprios, denominados de bases de dados NoSQL, que vêm substituir as bases de dados relacionais baseadas no paradigma relacional. Estes repositórios estão integrados num ecossistema vasto de ferramentas e tecnologias para lidar com este tipo de dados, o Hadoop. Neste seguimento, esta dissertação tem por objetivo estudar uma das muitas ferramentas associadas ao projeto Hadoop, o Kudu. Esta nova ferramenta, de arquitetura híbrida, promete preencher a lacuna existente entre as ferramentas de acesso a dados de forma sequencial e as ferramentas de acesso a dados de forma aleatória, simplificando, por isso, a arquitetura complexa que a utilização destes dois tipos de sistemas implica. Para cumprir os objetivos da dissertação foram realizados testes de desempenho com dois modelos de dados distintos, ao Kudu e a outras ferramentas destacadas na literatura, para possibilitar a comparação de resultados.Over the last few years we have witnessed an exponential increase in the amount of data produced. This increase is mainly due to the huge use of sensors, as well as the mass use of social networks and mobile devices that continuously collect data of different types and contexts. The processing and analysis of these data by the organizations translates into an undeniable competitive advantage in the increasingly competitive markets. For this reason, the study and development of new tools for the exploration of these data has attracted the attention of organizations and scientific community, since traditional techniques and technology have been unable to deal with data of this nature. In this context, the term Big Data appears, used to define this type of data of great volume, different degrees of complexity, and sometimes unstructured or disorganized. Associated with the term Big Data arise new data repositories with own logical models, denominated of databases NoSQL, that replace the traditional models. These repositories are integrated into a vast ecosystem of tools and technologies to handle this type of data, Hadoop. In this follow-up, this dissertation aims to study one of the many tools associated with the Hadoop project, Kudu. This new hybrid architecture tool promises to fill the gap between sequential data access tools and random data access tools, thereby simplifying the complex architecture that the use of these two types of systems implies. To fulfill the objectives of the dissertation, performance tests were performed with two different data models, over Kudu and other tools highlighted in the literature, to allow the comparison of results

    A Big Data perspective on Cyber-Physical Systems for Industry 4.0: modernizing and scaling complex event processing

    Get PDF
    Doctoral program in Advanced Engineering Systems for IndustryNowadays, the whole industry makes efforts to find the most productive ways of working and it already understood that using the data that is being produced inside and outside the factories is a way to improve the business performance. A set of modern technologies combined with sensor-based communication create the possibility to act according to our needs, precisely at the moment when the data is being produced and processed. Considering the diversity of processes existing in a factory, all of them producing data, Complex Event Processing (CEP) with the capabilities to process that amount of data is needed in the daily work of a factory, to process different types of events and find patterns between them. Although the integration of the Big Data and Complex Event Processing topics is already present in the literature, open challenges in this area were identified, hence the reason for the contribution presented in this thesis. Thereby, this doctoral thesis proposes a system architecture that integrates the CEP concept with a rulebased approach in the Big Data context: the Intelligent Event Broker (IEB). This architecture proposes the use of adequate Big Data technologies in its several components. At the same time, some of the gaps identified in this area were fulfilled, complementing Event Processing with the possibility to use Machine Learning Models that can be integrated in the rules' verification, and also proposing an innovative monitoring system with an immersive visualization component to monitor the IEB and prevent its uncontrolled growth, since there are always several processes inside a factory that can be integrated in the system. The proposed architecture was validated with a demonstration case using, as an example, the Active Lot Release Bosch's system. This demonstration case revealed that it is feasible to implement the proposed architecture and proved the adequate functioning of the IEB system to process Bosch's business processes data and also to monitor its components and the events flowing through those components.Hoje em dia as indústrias esforçam-se para encontrar formas de serem mais produtivas. A utilização dos dados que são produzidos dentro e fora das fábricas já foi identificada como uma forma de melhorar o desempenho do negócio. Um conjunto de tecnologias atuais combinado com a comunicação baseada em sensores cria a possibilidade de se atuar precisamente no momento em que os dados estão a ser produzidos e processados, assegurando resposta às necessidades do negócio. Considerando a diversidade de processos que existem e produzem dados numa fábrica, as capacidades do Processamento de Eventos Complexos (CEP) revelam-se necessárias no quotidiano de uma fábrica, processando diferentes tipos de eventos e encontrando padrões entre os mesmos. Apesar da integração do conceito CEP na era de Big Data ser um tópico já presente na literatura, existem ainda desafios nesta área que foram identificados e que dão origem às contribuições presentes nesta tese. Assim, esta tese de doutoramento propõe uma arquitetura para um sistema que integre o conceito de CEP na era do Big Data, seguindo uma abordagem baseada em regras: o Intelligent Event Broker (IEB). Esta arquitetura propõe a utilização de tecnologias de Big Data que sejam adequadas aos seus diversos componentes. As lacunas identificadas na literatura foram consideradas, complementando o processamento de eventos com a possibilidade de utilizar modelos de Machine Learning com vista a serem integrados na verificação das regras, propondo também um sistema de monitorização inovador composto por um componente de visualização imersiva que permite monitorizar o IEB e prevenir o seu crescimento descontrolado, o que pode acontecer devido à integração do conjunto significativo de processos existentes numa fábrica. A arquitetura proposta foi validada através de um caso de demonstração que usou os dados do Active Lot Release, um sistema da Bosch. Os resultados revelaram a viabilidade da implementação da arquitetura e comprovaram o adequado funcionamento do sistema no que diz respeito ao processamento dos dados dos processos de negócio da Bosch e à monitorização dos componentes do IEB e eventos que fluem através desses.This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020, the Doctoral scholarship PD/BDE/135101/2017 and by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039479; Funding Reference: POCI-01- 0247-FEDER-039479]

    Evaluating several design patterns and trends in big data warehousing systems

    No full text
    CAiSE 2018The Big Data characteristics, namely volume, variety and velocity, currently highlight the severe limitations of traditional Data Warehouses (DWs). Their strict relational model, costly scalability, and, sometimes, inefficient performance open the way for emerging techniques and technologies. Recently, the concept of Big Data Warehousing is gaining attraction, aiming to study and propose new ways of dealing with the Big Data challenges in Data Warehousing contexts. The Big Data Warehouse (BDW) can be seen as a flexible, scalable and highly performant system that uses Big Data techniques and technologies to support mixed and complex analytical workloads (e.g., streaming analysis, ad hoc querying, data visualization, data mining, simulations) in several emerging contexts like Smart Cities and Industries 4.0. However, due to the almost embryonic state of this topic, the ambiguity of the constructs and the lack of common approaches still prevails. In this paper, we discuss and evaluate some design patterns and trends in Big Data Warehousing systems, including data modelling techniques (e.g., star schemas, flat tables, nested structures) and some streaming considerations for BDWs (e.g., Hive vs. NoSQL databases), aiming to foster and align future research, and to help practitioners in this area.FCT - Fundação para a Ciência e a Tecnologia (UID/CEC/00319/2013)info:eu-repo/semantics/publishedVersio

    Abordagem Big Data a dados de mobilidade em transportes públicos

    Get PDF
    A necessidade de armazenar, processar e analisar os dados é uma realidade cada vez presente nas empresas onde as decisões de negócio dependem muito das plataformas digitais. A introdução do conceito de Data Warehouse teve como finalidade facilitar e melhorar o processo de recolha de indicadores de negócio imprescindíveis. O conceito de Big Data veio com o aumento da variedade e do volume de dados para fins de análise. Com esse conceito em mente, foram desenvolvidas tecnologias para fazerem face aos desafios impostos. A transformação digital no registo de entradas e saídas nos transportes público leva a grandes volumes de dados que podem ser usados para aplicar análises de negócio na área [1]. Este projeto visa a recolha de um conjunto de tecnologias na vertente do Big Data e a avaliação da capacidade de armazenamento, do método de elaboração dos métodos de ETL e do desempenho na obtenção de resposta a um conjunto de queries, consoante o aumento do volume de dados de mobilidade, referentes às entradas dos autocarros da companhia de transportes públicos Horários do Funchal. É introduzida neste projeto uma revisão de literatura sobre os conceitos de Data Warehouse, de modelos dimensionais e de Big Data, bem como nas tecnologias existentes e trabalhos relacionados com manipulação de Big Data. Foi também objeto de análise do estado da arte a aplicação destas tecnologias nos transportes públicos. Os resultados apresentados revelam que algumas plataformas conseguem adequar-se bem ao um aumento do volume de dados, com boas capacidades de desempenho, tanto na execução de processos de ETL, como na execução de queries de consulta, em comparação a outras tecnologias, cujo resultados são pouco práticos neste tipo de estudo.The need to store, process and analyse data is a increasingly present reality in companies where business decisions depend heavily on digital platforms. The purpose of introducing the Data Warehouse concept was to facilitate and improve the process of collecting essential business indicators. The concept of Big Data came with the increase in the variety and the volume of data for analysis purposes. With the concept in mind, technologies were developed to face the imposed challenges. The digital transformation in the registration of entrances and exits in the public transport lead to large volumes of data that can be used to apply business analysis [1]. This project aims to collect a set of technologies in the field of Big Data and evaluate the storage capacity, the method of elaborating ETL methods and the performance in obtaining a response to a set of queries, referring to the entrances of the buses of public transport company Horários do Funchal. This project introduces a literature review on the concepts of Data Warehouse, dimensional models and Big Data, as well as existing technologies and work related to Big Data manipulation. The application of these technologies in public transport was also subject to a state-of-the-art analysis. The presented results reveal that some platforms are able to adapt well to an increase in the volume, with good performance capabilities, both in the execution of ETL processes and in the execution of queries, in comparison to other technologies, whose results are impractical in this type of study
    corecore