11 research outputs found

    Join operation for semantic data enrichment of asynchronous time series data

    Get PDF
    In this paper, we present a novel framework for enriching time series data in smart cities by supplementing it with information from external sources via semantic data enrichment. Our methodology effectively merges multiple data sources into a uniform time series, while addressing difficulties such as data quality, contextual information, and time lapses. We demonstrate the efficacy of our method through a case study in Barcelona, which permitted the use of advanced analysis methods such as windowed cross-correlation and peak picking. The resulting time series data can be used to determine traffic patterns and has potential uses in other smart city sectors, such as air quality, energy efficiency, and public safety. Interactive dashboards enable stakeholders to visualize and summarize key insights and patterns.Postprint (published version

    Supply chain hybrid simulation: From Big Data to distributions and approaches comparison

    Get PDF
    The uncertainty and variability of Supply Chains paves the way for simulation to be employed to mitigate such risks. Due to the amounts of data generated by the systems used to manage relevant Supply Chain processes, it is widely recognized that Big Data technologies may bring benefits to Supply Chain simulation models. Nevertheless, a simulation model should also consider statistical distributions, which allow it to be used for purposes such as testing risk scenarios or for prediction. However, when Supply Chains are complex and of huge-scale, performing distribution fitting may not be feasible, which often results in users focusing on subsets of problems or selecting samples of elements, such as suppliers or materials. This paper proposed a hybrid simulation model that runs using data stored in a Big Data Warehouse, statistical distributions or a combination of both approaches. The results show that the former approach brings benefits to the simulations and is essential when setting the model to run based on statistical distributions. Furthermore, this paper also compared these approaches, emphasizing the pros and cons of each, as well as their differences in computational requirements, hence establishing a milestone for future researches in this domain.This work has been supported by national funds through FCT -Fundacao para a Ciencia e Tecnologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    On the use of simulation as a Big Data semantic validator for supply chain management

    Get PDF
    Simulation stands out as an appropriate method for the Supply Chain Management (SCM) field. Nevertheless, to produce accurate simulations of Supply Chains (SCs), several business processes must be considered. Thus, when using real data in these simulation models, Big Data concepts and technologies become necessary, as the involved data sources generate data at increasing volume, velocity and variety, in what is known as a Big Data context. While developing such solution, several data issues were found, with simulation proving to be more efficient than traditional data profiling techniques in identifying them. Thus, this paper proposes the use of simulation as a semantic validator of the data, proposed a classification for such issues and quantified their impact in the volume of data used in the final achieved solution. This paper concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs. In fact, it was also found that simulation can help in identifying and bypassing some of these issues.This work has been supported by FCT (Fundacao para a Ciencia e Tecnologia) within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

    Get PDF
    Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER com-ponent, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814]

    Big Data em cidades inteligentes: um mapeamento sistemático

    Get PDF
    O conceito de Cidades Inteligentes ganhou maior atenção nos círculos acadêmicos, industriais e governamentais. À medida que a cidade se desenvolve ao longo do tempo, componentes e subsistemas como redes inteligentes, gerenciamento inteligente de água, tráfego inteligente e sistemas de transporte, sistemas de gerenciamento de resíduos inteligentes, sistemas de segurança inteligentes ou governança eletrônica são adicionados. Esses componentes ingerem e geram uma grande quantidade de dados estruturados, semiestruturados ou não estruturados que podem ser processados usando uma variedade de algoritmos em lotes, microlotes ou em tempo real, visando a melhoria de qualidade de vida dos cidadãos. Esta pesquisa secundária tem como objetivo facilitar a identificação de lacunas neste campo, bem como alinhar o trabalho dos pesquisadores com outros para desenvolver temas de pesquisa mais fortes. Neste estudo, é utilizada a metodologia de pesquisa formal de mapeamento sistemático para fornecer uma revisão abrangente das tecnologias de Big Data na implantação de cidades inteligentes

    IDEAS-1997-2021-Final-Programs

    Get PDF
    This document records the final program for each of the 26 meetings of the International Database and Engineering Application Symposium from 1997 through 2021. These meetings were organized in various locations on three continents. Most of the papers published during these years are in the digital libraries of IEEE(1997-2007) or ACM(2008-2021)

    Apache Kudu: vantagens e desvantagens na análise de vastas quantidades de dados

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoDurante os últimos anos, temos assistido a um aumento exponencial da quantidade de dados produzidos. Este aumento deve-se, principalmente, à enorme utilização de sensores, assim como à massificação da utilização das redes sociais e de dispositivos móveis que, em permanência, recolhem dados de diversos tipos e contextos. O tratamento e análise destes dados por parte das organizações traduz-se numa inegável vantagem competitiva nos mercados, cada vez mais exigentes. Por este motivo, o estudo e desenvolvimento de novas ferramentas para a exploração destes dados tem atraído a atenção das organizações e também da comunidade científica, uma vez que as técnicas e tecnologia tradicionais se têm mostrado incapazes de lidar com dados de tal natureza. Neste contexto, surge o termo Big Data, utilizado para definir este tipo de dados de grande volume, diferentes graus de complexidade e, por vezes, não estruturados ou com um modelo de dados pré-definido. Associado ao termo Big Data surgem novos repositórios de dados com modelos lógicos próprios, denominados de bases de dados NoSQL, que vêm substituir as bases de dados relacionais baseadas no paradigma relacional. Estes repositórios estão integrados num ecossistema vasto de ferramentas e tecnologias para lidar com este tipo de dados, o Hadoop. Neste seguimento, esta dissertação tem por objetivo estudar uma das muitas ferramentas associadas ao projeto Hadoop, o Kudu. Esta nova ferramenta, de arquitetura híbrida, promete preencher a lacuna existente entre as ferramentas de acesso a dados de forma sequencial e as ferramentas de acesso a dados de forma aleatória, simplificando, por isso, a arquitetura complexa que a utilização destes dois tipos de sistemas implica. Para cumprir os objetivos da dissertação foram realizados testes de desempenho com dois modelos de dados distintos, ao Kudu e a outras ferramentas destacadas na literatura, para possibilitar a comparação de resultados.Over the last few years we have witnessed an exponential increase in the amount of data produced. This increase is mainly due to the huge use of sensors, as well as the mass use of social networks and mobile devices that continuously collect data of different types and contexts. The processing and analysis of these data by the organizations translates into an undeniable competitive advantage in the increasingly competitive markets. For this reason, the study and development of new tools for the exploration of these data has attracted the attention of organizations and scientific community, since traditional techniques and technology have been unable to deal with data of this nature. In this context, the term Big Data appears, used to define this type of data of great volume, different degrees of complexity, and sometimes unstructured or disorganized. Associated with the term Big Data arise new data repositories with own logical models, denominated of databases NoSQL, that replace the traditional models. These repositories are integrated into a vast ecosystem of tools and technologies to handle this type of data, Hadoop. In this follow-up, this dissertation aims to study one of the many tools associated with the Hadoop project, Kudu. This new hybrid architecture tool promises to fill the gap between sequential data access tools and random data access tools, thereby simplifying the complex architecture that the use of these two types of systems implies. To fulfill the objectives of the dissertation, performance tests were performed with two different data models, over Kudu and other tools highlighted in the literature, to allow the comparison of results

    Etiquetagem e rastreio de fontes de dados num Big Data Warehouse

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoOs avanços nas Tecnologias de Informação levam as organizações a procurar valor comercial e vantagem competitiva por meio da recolha, armazenamento, processamento e análise de dados. Os Data Warehouses surgem como uma peça fundamental no armazenamento dos dados, facilitando a sua análise sob diversas perspetivas e permitindo a extração de informação que poderá ser utilizada na tomada de decisão. A elevada disponibilidade de novas fontes de dados e os avanços que surgiram para a recolha e armazenamento dos mesmos, fazem com que seja produzida uma imensa quantidade de dados heterogéneos, gerados a taxas cada vez maiores. Adjacente a este facto surgiu o conceito de Big Data, associado ao volume, velocidade e variedade dos dados, ou seja, grandes volumes de dados com diferentes graus de complexidade, muitas vezes sem estrutura nem organização, caraterísticas estas que impossibilitam o uso de ferramentas tradicionais. Como tal, surge a necessidade de adotar o contexto de Big Data Warehouses, que naturalmente acarreta outros desafios, pois implica a adoção de novas tecnologias, assim como a adoção de novos modelos lógicos que permitem uma maior flexibilidade na gestão de dados não estruturados e desnormalizados. Por conseguinte, quando o volume de dados e a sua heterogeneidade começam a aumentar, uma vez que derivam de várias fontes que apresentam caraterísticas muito diferentes, emergem novos desafios associados ao Big Data, nomeadamente a Governança de Dados. A área de Governança de Dados abrange um grupo de subáreas, tais como Qualidade dos Dados e Gestão de Metadados, as quais oferecem um conjunto de processos para suportar a elevada complexidade inerente nos dados. À medida que o volume de dados num Big Data Warehouse começa a aumentar, os processos de negócio também aumentam, pelo que se torna necessário ter informação adicional sobre esses dados, por exemplo, que tabelas e atributos foram armazenados, quando e por quem foram criados e as diversas atualizações que sofreram. O objetivo desta dissertação é propor um sistema para a governança de um Big Data Warehouse, de modo a dar a conhecer o conteúdo do mesmo e a forma como este está a evoluir ao longo do tempo. Para tal, é proposto um sistema de catalogação de dados do Big Data Warehouse, baseado num grafo, através da etiquetagem e do rastreio de fontes de dados e posterior armazenamento dos metadados recolhidos numa base de dados. Para além de reunir as caraterísticas mais básicas dos dados, regista informações sobre políticas de acesso, profiling, a similaridade, key performance indicators e processos de negócio.Advances in Information Technologies lead organizations to search for commercial value and competitive advantage through collecting, storing, processing and analyzing data. Data Warehouses appear as a fundamental piece in data storage, facilitating data analysis from different perspectives and allowing the extraction of information that can be used in decision making. The high availability of new data sources and the advances that have been made for their collection and storage lead to the production of an enormous amount of heterogeneous data generated at increasing rates. Adjacent to this fact, the concept of Big Data appeared, associated to the volume, velocity and variety of data, that is, large volumes of data with different degrees of complexity, often without structure or organization, which makes it impossible to use traditional tools. Thus, the need arises to adopt the Big Data Warehouses context, which naturally brings other challenges, because it implies the adoption of new technologies, as well as the adoption of new logical models that allow greater flexibility in the management of unstructured and denormalized data. Therefore, when the volume of data and its heterogeneity start to increase, once they derive from several sources with very different characteristics, new challenges associated with Big Data emerge, namely Data Governance. The Data Governance domain covers a group of subdomains, such as Data Quality and Metadata Management, which provide a set of processes to support the high complexity inherent in the data. As the volume of data in a Big Data Warehouse starts to increase, the business processes also increase, meaning that it becomes important and necessary to know some additional information about these data, for example, which tables and attributes were stored, when and by whom were created and the several updates they suffered. The aim of this dissertation is to propose a governance system for the governance of a Big Data Warehouse, in order to make its content available, as well as how it is evolving over time. To this end, a graph-based Big Data Warehouse data cataloging system is proposed, by tagging and lineage of data sources and storing metadata in a database. In addition to gathering the basic characteristics of data, it records information about access policies, profiling, similarity, key performance indicators and business processes

    Abordagem semântica para a integração de dados em Big Data Warehouses

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoBig Data não é um domínio trivial, tanto ao nível de investigação, como de desenvolvimento. Atualmente, o volume de dados produzido tem aumentado exponencialmente devido à utilização de dispositivos como, por exemplo, smartphones, tablets, dispositivos inteligentes e sensores. Esta proliferação de dados que se apresentam em formatos estruturados, semiestruturados e não estruturados foi acompanhada pela popularidade do conceito de Big Data, que pode ser caracterizado como o volume, velocidade e variedade que os dados apresentam e que não conseguem ser processados, armazenados e analisados através de ferramentas e métodos tradicionais. As organizações, inseridas em ambientes altamente competitivos, visam a obtenção de vantagens competitivas perante os seus concorrentes, comprometendo-se a extrair o maior valor das tecnologias com o objetivo de melhorar a sua tomada de decisão. A título de exemplo, os Data Warehouses surgem como componentes centrais no armazenamento de dados, no entanto, estes repositórios de dados regem-se por modelos relacionais que os impossibilita de responder às exigências de Big Data. Consequentemente, surge a necessidade da adoção de novas tecnologias e modelos lógicos capazes de colmatar os desafios de Big Data, originando assim os Big Data Warehouses, que utilizados em tecnologias como Hadoop ou bases de dados NoSQL garantem uma maior flexibilidade e escalabilidade na manipulação de dados em contextos Big Data. A dimensão do Big Data Warehouse conduz a um acréscimo de complexidade nos domínios de Governança de Dados e Data Quality devido ao grande volume de dados que é continuamente armazenado. Contudo, inserido num domínio intrínseco a Data Quality, Data Profiling vem colmatar alguns destes desafios através da produção de metadados sobre os conjuntos de dados que chegam ao Big Data Warehouse, ganhando assim uma nova importância na integração entre as novas fontes de dados e os dados que já subsistem no Big Data Warehouse. Desta forma, o principal objetivo deste trabalho é propor, desenvolver e validar uma ferramenta de Data Profiling que permita inspecionar novas fontes de dados, derivando e armazenando informação relevante para a sua integração no Big Data Warehouse.Big Data is not a trivial domain regarding the research and development topic. Currently, the amount of data produced has increased due to the use of gadgets such as smartphones, tablets, smart devices, and sensors. Bearing that in mind, the proliferation of data presented in structured, semi-structured and unstructured formats was accompanied by the popularity of the Big Data concept that can be characterized by volume, velocity, and variety of data which cannot be processed, stored and analyzed through traditional tools. The organizations inserted in highly competitive environments aim to obtain competitive advantages over their competitors, committing themselves to extract the highest value of the technologies in order to improve their decision making. For example, Data Warehouses appear as central components in data storage supported by rigid models. However, these data repositories can no longer answer the high demand of Big Data reality. Therefore, there is the need to adopt new technologies and logical models capable of solving Big Data challenges, giving the rise to Big Data Warehouses which are used in technologies such as Hadoop or NoSQL databases to ensure higher flexibility and scalability in data manipulation in Big Data contexts. The Big Data Warehouse size leads to an increase in the complexity concerning the domains of Data Governance and Data Quality, due to the high volume of data that is continuously stored. Nevertheless, embedded in the Data Quality domain, Data Profiling approach solves some of these challenges producing metadata about datasets which are being sent to the Big Data Warehouse, raising awareness to the relevance of the integration between new data sources and data which is already stored in the Big Data Warehouse. Considering all information exposed, the main purpose of this work is to propose, develop and validate a Data Profiling tool that allows inspecting new data sources, storing and deriving relevant information to its integration in Big Data Warehouse

    The SusCity big data warehousing approach for smart cities

    No full text
    Nowadays, the concept of Smart City provides a rich analytical context, highlighting the need to store and process vast amounts of heterogeneous data flowing at different velocities. #is data is defined as Big Data, which imposes significant difficulties in traditional data techniques and technologies. Data Warehouses (DWs) have long been recognized as a fundamental enterprise asset, providing fact-based decision support for several organizations. #e concept of DW is evolving. Traditionally, Relational Database Management Systems (RDBMSs) are used to store historical data, providing different analytical perspectives regarding several business processes. With the current advancements in Big Data techniques and technologies, the concept of Big Data Warehouse (BDW) emerges to surpass several limitations of traditional DWs. #is paper presents a novel approach for designing and implementing BDWs, which has been supporting the SusCity data visualization platform. #e BDW is a crucial component of the SusCity research project in the context of Smart Cities, supporting analytical tasks based on data collected in the city of Lisbon.This work has been supported by COMPETE: POCI-01-0145- FEDER- 007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and the SusCity project, MITP-TB/CS/0026/2013.info:eu-repo/semantics/publishedVersio
    corecore