15 research outputs found

    Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

    Get PDF
    Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER com-ponent, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814]

    Analisis Faktor Optimasi untuk Data Warehouse dengan Data Tabungan pada Bank XYZ

    Get PDF
    Berkembangnya proses teknologi informasi dimanfaatkan oleh banyak perusahaan untuk meningkatkan kinerja bisnisnya, yaitu dengan cara penyajian data yang terintegrasi dan konsisten. Data warehouse merupakan suatu ilmu yang menunjang proses analisis perusahaan, dengan cara menyediakan data yang terintegrasi dari berbagai sumber basis data. Seiring dengan bertambahnya jumlah dan kompleksitas data, perusahaan yang menggunakan media komputer untuk menyimpan datanya akan memerlukan resource yang lebih banyak dari tahun sebelumnya. Pemprosesan data yang banyak ini tentunya memerlukan media penyimpanan berupa basis data yang optimal, sehingga proses analisis dari perusahaan tersebut dapat berjalan secara cepat dan efisien. Pada penelitian ini, penulis akan membangun data warehouse dengan membuat tabel kondisi awal, membuat mapping untuk memasukkan data-data yang diperlukan, membuat tabel optimasi dengan 7 kondisi yang berbeda dari kombinasi partisi, bucketing, dan kompresi, lalu menganalisis performa dari tabel tersebut menggunakan query yang akan sering digunakan untuk analisis sederhana. Performa yang akan dianalisis pada penelitian ini adalah dari segi waktu jalan query dan ruang penyimpanan yang digunakan oleh masing-masing tabel. Berdasarkan pengujian dari penelitian ini dengan menggunakan 3 query, dihasilkan kesimpulan bahwa penggunaan partisi dan bucketing mempercepat rata-rata jalannya query sebesar 28%, sementara penggunaan kompresi data mempengaruhi rata-rata ukuran ruang penyimpanan data sebesar 8 hingga 30 kali lebih kecil jika dibandingkan dengan tabel yang tidak dikompresi, namun penggunaan kompresi ini memperlambat rata-rata jalannya waktu query sebesar  77% atau sekitar hampir dua kali lipat

    Challenging SQL-on-Hadoop performance with Apache Druid

    Get PDF
    In Big Data, SQL-on-Hadoop tools usually provide satisfactory performance for processing vast amounts of data, although new emerging tools may be an alternative. This paper evaluates if Apache Druid, an innovative column-oriented data store suited for online analytical processing workloads, is an alternative to some of the well-known SQL-on-Hadoop technologies and its potential in this role. In this evaluation, Druid, Hive and Presto are benchmarked with increasing data volumes. The results point Druid as a strong alternative, achieving better performance than Hive and Presto, and show the potential of integrating Hive and Druid, enhancing the potentialities of both tools.This work is supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundacao para a Ciencia e Tecnologia within Project UID/CEC/00319/2013 and by European Structural and Investment Funds in the FEDER component, COMPETE 2020 (Funding Reference: POCI-01-0247-FEDER-002814)

    Supply chain simulation in a Big Data context: risks and uncertainty analysis

    Get PDF
    Due to their complex and dynamic nature, Supply Chains are prone to risks that may occur at any time and place. To tackle this problem, simulation can be used. However, such models should use Big Data technologies, in order to provide the level of data and detail contained in the data sources associated to the business processes. In this regard, this paper considered a real case of an automotive electronics Supply chain. Hence, the purpose of this paper is to propose a simulation tool, which uses real industrial data, provided by a Big Data Warehouse, and use such decision-support artifact to test different types of risks. More concretely, risks in the supply and demand end of the network are analyzed. The presented results also demonstrate the possible benefits that can be achieved by using simulation in the analysis of risks in a Supply Chain.This work has been supported by FCT–Fundação para a Ciência e Tec-nologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology andHigher Education, through national funds, and co-financed by the European Social Fund(ESF) through the Operational Programme for Human Capital (POCH)

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data

    Get PDF
    The rapid growth in data volume and complexity has needed the adoption of advanced technologies to extract valuable insights for decision-making. This project aims to address this need by developing a comprehensive framework that combines Big Data processing, analytics, and visualization techniques to enable effective analysis of World Bank data. The problem addressed in this study is the need for a scalable and efficient Business Intelligence solution that can handle the vast amounts of data generated by the World Bank. Therefore, a Big Data architecture is implemented on a real use case for the International Bank of Reconstruction and Development. The findings of this project demonstrate the effectiveness of the proposed solution. Through the integration of Apache Spark and Apache Hive, data is processed using Extract, Transform and Load techniques, allowing for efficient data preparation. The use of Apache Kylin enables the construction of a multidimensional model, facilitating fast and interactive queries on the data. Moreover, data visualization techniques are employed to create intuitive and informative visual representations of the analysed data. The key conclusions drawn from this project highlight the advantages of a Big Data-driven Business Intelligence solution in processing and analysing World Bank data. The implemented framework showcases improved scalability, performance, and flexibility compared to traditional approaches. In conclusion, this bachelor thesis presents a Business Intelligence solution based on a Big Data architecture for processing and analysing the World Bank data. The project findings emphasize the importance of scalable and efficient data processing techniques, multidimensional modelling, and data visualization for deriving valuable insights. The application of these techniques contributes to the field by demonstrating the potential of Big Data Business Intelligence solutions in addressing the challenges associated with large-scale data analysis

    On the use of simulation as a Big Data semantic validator for supply chain management

    Get PDF
    Simulation stands out as an appropriate method for the Supply Chain Management (SCM) field. Nevertheless, to produce accurate simulations of Supply Chains (SCs), several business processes must be considered. Thus, when using real data in these simulation models, Big Data concepts and technologies become necessary, as the involved data sources generate data at increasing volume, velocity and variety, in what is known as a Big Data context. While developing such solution, several data issues were found, with simulation proving to be more efficient than traditional data profiling techniques in identifying them. Thus, this paper proposes the use of simulation as a semantic validator of the data, proposed a classification for such issues and quantified their impact in the volume of data used in the final achieved solution. This paper concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs. In fact, it was also found that simulation can help in identifying and bypassing some of these issues.This work has been supported by FCT (Fundacao para a Ciencia e Tecnologia) within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    Are simulation tools ready for big data? Computational experiments with supply chain models developed in Simio

    Get PDF
    Peer-review under responsibility of the scientific committee of the International Conference on Industry 4.0 and Smart Manufacturing. The need and potential benefits for the combined use of Simulation and Big Data in Supply Chains (SCs) has been widely recognized. Having worked on such project, some simulation experiments of the modelled SC system were conducted in SIMIO. Different circumstances were tested, including running the model based on the stored data, on statistical distributions and considering risk situations. Thus, this paper aimed to evaluate such experiments, to evaluate the performance of simulations in these contexts. After analyzing the obtained results, it was found that whilst running the model based on the real data required considerable amounts of computer memory, running the model based on statistical distributions reduced such values, albeit required considerable higher time to run a single replication. In all the tested experiments, the simulation took considerable time to run and was not smooth, which can reduce the stakeholders' interest in the developed tool, despite its benefits for the decision-making process. For future researches, it would be beneficial to test other simulation tools and other strategies and compare those results to the ones provided in this paper.This work has been supported by national funds through FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    Bypassing data issues of a supply chain simulation model in a big data context

    Get PDF
    Peer-review under responsibility of the scientific committee of the International Conference on Industry 4.0 and Smart Manufacturing. Supply Chains (SCs) are complex and dynamic networks, where certain events may cause severe problems. To avoid them, simulation can be used, allowing the uncertainty of these systems to be considered. Furthermore, the data that is generated at increasingly high volumes, velocities and varieties by relevant data sources allow, on one hand, the simulation model to capture all the relevant elements. While developing such solution, due to the inherent use of simulation, several data issues were identified and bypassed, so that the incorporated elements comprise a coherent SC simulation model. Thus, the purpose of this paper is to present the main issues that were faced, and discuss how these were bypassed, while working on a SC simulation model in a Big Data context and using real industrial data from an automotive electronics SC. This paper highlights the role of simulation in this task, since it worked as a semantic validator of the data. Moreover, this paper also presents the results that can be obtained from the developed model.This work has been supported by national funds through FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

    Big Data Analytics for vehicle multisensory anomalies detection

    Get PDF
    Autonomous driving is assisted by different sensors, each providing information about certain parameters. What we are looking for is an integrated perspective of all these parameters to drive us into better decisions. To achieve this goal, a system that can handle these Big Data issues regarding volume, velocity and variety is needed. This paper aims to design and develop a real-time Big Data Warehouse repository, integrating the data generated by the multiple sensors developed in the context of IVS (In-Vehicle Sensing) systems; the data to be stored in this repository should be merged, which will imply its processing, consolidation and preparation for the analytical mechanisms that will be required. This multisensory fusion is important because it allows the integration of different perspectives in terms of sensor data, since they complement each other. Therefore, it can enrich the entire analysis process at the decision-making level, for instance, understanding what is going on inside the cockpit.This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020 and by the European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039334; Funding Reference: POCI-01-0247-FEDER-039334]
    corecore