105 research outputs found

    Building data warehouses in the era of big data: an approach for scalable and flexible big data warehouses

    Get PDF
    During the last few years, the concept of Big Data Warehousing gained significant attention from the scientific community, highlighting the need to make design changes to the traditional Data Warehouse (DW) due to its limitations, in order to achieve new characteristics relevant in Big Data contexts (e.g., scalability on commodity hardware, real-time performance, and flexible storage). The state-of-the-art in Big Data Warehousing reflects the young age of the concept, as well as ambiguity and the lack of common approaches to build Big Data Warehouses (BDWs). Consequently, an approach to design and implement these complex systems is of major relevance to business analytics researchers and practitioners. In this tutorial, the design and implementation of BDWs is targeted, in order to present a general approach that researchers and practitioners can follow in their Big Data Warehousing projects, exploring several demonstration cases focusing on system design and data modelling examples in areas like smart cities, retail, finance, manufacturing, among others

    Enhancing Big Data Warehousing and Analytics for Spatio-Temporal Massive Data

    Get PDF
    The increasing amount of data generated by earth observation missions like Copernicus, NASA Earth Data, and climate stations is overwhelming. Every day, terabytes of data are collected from these resources for different environment applications. Thus, this massive amount of data should be effectively managed and processed to support decision-makers. In this paper, we propose an information system-based on a low latency spatio-temporal data warehouse which aims to improve drought monitoring analytics and to support the decision-making process. The proposed framework consists of 4 main modules: (1) data collection, (2) data preprocessing, (3) data loading and storage, and (4) the visualization and interpretation module. The used data are multi-source and heterogeneous collected from various sources like remote sensing sensors, biophysical sensors, and climate sensors. Hence, this allows us to study drought in different dimensions. Experiments were carried out on a real case of drought monitoring in China between 2000 and 2020

    Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

    Get PDF
    Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER com-ponent, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814]

    Modelação ágil para sistemas de Big Data Warehousing

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoOs Sistemas de Informação, com a popularização do conceito de Big Data começaram a considerar aspetos relativos às infraestruturas capazes de lidar com a recolha, armazenamento, processamento e análise de vastas quantidades de dados heterogéneos, como pouca estrutura (ou nenhuma) e gerados a velocidades cada vez maiores. Estes têm sido os desafios inerentes à transição do Data Modelling em Data Warehouses tradicionais para ambientes de Big Data. O estado-de-arte reflete que a área científica de Big Data Warehousing é recente, ambígua e apresenta lacunas relativas a abordagens para a conceção e implementação destes sistemas; deste modo, nos últimos anos, vários autores motivados pela ausência de trabalhos científicos e técnicos desenvolveram estudos na área com o intuito de explorar modelos adequados (representação de componentes lógicas e tecnológicas, data flows e estruturas de dados), métodos e instanciações (casos de demonstração recorrendo a protótipos e benchmarks). A presente dissertação está inserida no estudo da proposta geral dos padrões de design para sistemas de Big Data Warehousing (M. Y. Santos & Costa, 2019) e, posteriormente, é efetuada a proposta de um método, em vista a semiautomatização da proposta de design dos autores referidos, constituído por sete regras computacionais, apresentadas, demonstradas e validadas com exemplos baseados em contextos reais. De forma a apresentar o processo de modelação ágil, foi criado um fluxograma para cada regra, permitindo assim apresentar todos passos. Comparando os resultados dos exemplos obtidos após aplicação do método e dos resultantes de uma modelação totalmente manual, o trabalho proposto apresenta uma proposta de modelação geral, que funciona como uma sugestão de modelação de Big Data Warehouses para o utilizador que, posteriormente, deve validar e ajustar o resultado tendo em consideração o contexto do caso em análise, as queries que pretende utilizar e as características dos dados.Information Systems, with the popularization of Big Data, have started to consider the aspects related to infrastructures capable of dealing with collection, storage, processing and analysis of vast amounts of heterogeneous data, with little or no structure and produced at increasing speed. These have been the challenges inherent to the transition from Data Modelling into traditional Data Warehouses for Big Data environments. The state-of-the-art reflects that the scientific field of Big Data Warehousing is recent, ambiguous and that it shows a few gaps regarding the approaches to the design and implementation of these systems; thus, in the past few years, several authors, motivated by the lack of scientific and technical work, have developed some studies in this scientific area in order to explore appropriated models (representation of logical and technological components, data flows and data structures), methods and instantiations (demonstration cases using prototypes and benchmarks). This dissertation is inserted in the study of the general proposal of design standards for Big Data Warehousing systems. Late on, the proposed method is comprised of seven sequential rules which are thoroughly explained, demonstrated and validated with relevante exemples based on common real use-cases. For each rule, step-by-step flowchart is provider an agile modelling process. When compared a fully manual example, the proposed work offered a correct but genereal resulting model that works best as a first modelling effort that should then be validated by a use-case expert

    On the use of simulation as a Big Data semantic validator for supply chain management

    Get PDF
    Simulation stands out as an appropriate method for the Supply Chain Management (SCM) field. Nevertheless, to produce accurate simulations of Supply Chains (SCs), several business processes must be considered. Thus, when using real data in these simulation models, Big Data concepts and technologies become necessary, as the involved data sources generate data at increasing volume, velocity and variety, in what is known as a Big Data context. While developing such solution, several data issues were found, with simulation proving to be more efficient than traditional data profiling techniques in identifying them. Thus, this paper proposes the use of simulation as a semantic validator of the data, proposed a classification for such issues and quantified their impact in the volume of data used in the final achieved solution. This paper concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs. In fact, it was also found that simulation can help in identifying and bypassing some of these issues.This work has been supported by FCT (Fundacao para a Ciencia e Tecnologia) within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    The potential of semantic paradigm in warehousing of big data

    Get PDF
    Big data have analytical potential that was hard to realize with available technologies. After new storage paradigms intended for big data such as NoSQL databases emerged, traditional systems got pushed out of the focus. The current research is focused on their reconciliation on different levels or paradigm replacement. Similarly, the emergence of NoSQL databases has started to push traditional (relational) data warehouses out of the research and even practical focus. Data warehousing is known for the strict modelling process, capturing the essence of the business processes. For that reason, a mere integration to bridge the NoSQL gap is not enough. It is necessary to deal with this issue on a higher abstraction level during the modelling phase. NoSQL databases generally lack clear, unambiguous schema, making the comprehension of their contents difficult and their integration and analysis harder. This motivated involving semantic web technologies to enrich NoSQL database contents by additional meaning and context. This paper reviews the application of semantics in data integration and data warehousing and analyses its potential in integrating NoSQL data and traditional data warehouses with some focus on document stores. Also, it gives a proposal of the future pursuit directions for the big data warehouse modelling phases

    Challenging SQL-on-Hadoop performance with Apache Druid

    Get PDF
    In Big Data, SQL-on-Hadoop tools usually provide satisfactory performance for processing vast amounts of data, although new emerging tools may be an alternative. This paper evaluates if Apache Druid, an innovative column-oriented data store suited for online analytical processing workloads, is an alternative to some of the well-known SQL-on-Hadoop technologies and its potential in this role. In this evaluation, Druid, Hive and Presto are benchmarked with increasing data volumes. The results point Druid as a strong alternative, achieving better performance than Hive and Presto, and show the potential of integrating Hive and Druid, enhancing the potentialities of both tools.This work is supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundacao para a Ciencia e Tecnologia within Project UID/CEC/00319/2013 and by European Structural and Investment Funds in the FEDER component, COMPETE 2020 (Funding Reference: POCI-01-0247-FEDER-002814)

    An internet of things enabled framework to monitor the lifecycle of Cordyceps sinensis mushrooms

    Get PDF
    Cordyceps sinensis is an edible mushroom found in high quantities in the regions of the Himalayas and widely considered in traditional systems of medicine. It is a non-toxic remedy mushroom and has a high measure of clinical medical benefits including cancer restraint, high blood pressure, diabetes, asthma, depression, fatigue, immune disorder, and many infections of the upper respiratory tract. The cultivation of this kind of mushroom is limited to the region of the Sikkim and to cultivate in the other regions of the country, they are need of investigation and prediction of cordyceps sinensis mushroom lifecycle. From the studies, it is concluded that the precision-based agriculture techniques are limitedly explored for the prediction and growth of Cordyceps sinensis mushrooms. In this study, an internet of things (IoT) inspired framework is proposed to predict the lifecycle of Cordyceps sinensis mushrooms and also provide alternate substrate to cultivate Cordyceps sinensis mushrooms in other parts of the country. As a part of lifecycle prediction, a framework is proposed in this study. According to the findings, an IoT sensor-based system with the ideal moisture level of the mushroom rack is required for the growth of Cordyceps sinensis mushrooms
    corecore