27 research outputs found

    Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

    Get PDF
    Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER com-ponent, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814]

    BigSQLTraj: A SQL-extended framework for storing & querying big mobility data

    Get PDF
    Τα τελευταία χρόνια, λόγω της ευρείας χρήση αισθητήρων και έξυπνων συσκευών, παρατηρείται μια εκθετική παραγωγή δεδομένων κίνησης, που εντάσσονται στην κατηγορία δεδομένα μεγάλης κλίμακας (big data). Για παράδειγμα εφαρμογές δρομολόγησης, παρακολούθηση κυκλοφοριακής ροής, έλεγχος στόλου ακόμη και προβλέψεις ή αποφυγή κινδύνων βασίζονται στην επεξεργασία χωρικών και χωροχρονικών δεδομένων. Τα δεδομένα αυτά πρέπει να αποθηκεύονται και να επεξεργάονται κατάλληλα ώστε στη συνέχεια να αποτελέσουν γνώση για τους οργανισμούς. Προφανώς η διδακασία αυτή απαιτεί συστήματα και τεχνολογίες κατάλληλες για τον μεγάλο όγκο δεδομένων εισόδου. Στην παρούσα διπλωματική εργασία χρησιμοποιήσαμε δεδομένων από κινήσεις πλοίων και πιο συγκεκριμένα δεδομένα που παράγονται από το automatic identification system (AIS). Για τους σκοπούς της συγκεκριμένης διπλωματικής εργασίας αναπτύχθηκε το σύστημα BigSQLTraj: Ένα πλαίσιο βασισμένο σε SQL για την αποθήκευση και επερώτηση μεγάλων δεδομένων απο κινούμενα αντικείμενα. Οι εφαρμογές μεγάλων δεδομένων περιλαμβάνουν τα επίπεδα διαχείρισης, επεξεργασίας, αναλυτικές και οπτικοποίησης δεδομένων απο ετερογενής πηγές ή σε ιστορικά δεδομένα ή σε δεδομένα ροών. Στην παρούσα διπλωματική εργασία εξετάζουμε τα επίπεδα διαχείρισης και επεξεργασίας μεγάλων ιστορικών δεδομένων. Στόχος του συστήματος είναι να παρέχει την δυνατότητα σε χρήστες να αποθηκεύουν και να επεξεργάζονται με αποδοτικό τρόπο μεγάλα γεωχωρικά και χωροχρονικά δεδομένα πάνω από ένα κατενεμημένο σύστημα επεκτείνωντας ή αναπαράγοντας μεθόδους και αλγορίθμους από ήδη υπάρχοντα συστήματα. Πρώτος στόχος της εργασίας είναι να επιλεχθούν εργαλία που θα μπορούν να επικοινωνούν μεταξύ τους και θα παρουσιάζουν μια ενιαία εικόνα στους εξωτερικούς χρήστες. Οι καινοτομίες που παρέχει το σύστημα είναι η δημιουργία μεθόδων για ισοκατανεμημένη, αλλά ταυτόχρονα βασισμένη στην ομοιότητα, διαμέριση των δεδομένων στους κόμβους της συστάδας υπολογιστών, η δημιουργία μιας SQL διεπαφής στο κατανεμημένο σύστημα που θα παρέχει εξελιγμένες μεθόδους για την επεξεργασία των αποθηκευμένων δεδομένων και θα επιτρέπει σε συστήματα που ήδη αλληλεπιδρούν με συστήματα βασισμένα σε SQL να μεταφερθούν σε τεχνολογίες μεγάλων δεδομένων με τις ελάχιστες δυνατές αλλαγές. Πρώτος στόχος της παρούσας διπλωματικής εργασίας είναι η ενσωμάτωση (integration) διάφορων τεχνολογιών. Η υλοποίηση της παρούσας διπλωματικής βασίζεται σε βιβλιοθήκες ανοιχτού κώδικα για επεξεργασία μεγάλων δεδομένων. Οι βιβλιοθήκες αυτές είναι: Apache Hadoop, Apache Spark, Apache Hive και Apache Tez. Οι βασικότερες λειτουργίες που παρέχει η βιβλιοθήκη Apache Hadoop είναι το κατανεμημένο σύστημα αρχείων (Hadoop Distributed File System) που γράφονται και διαβάζονται τα δεδομένα. Επιπλέον ο διαχειριστής πόρων του Apache Hadoop (Yarn - resource manager) που ελέγχει το φόρτο εργασίας των υπολογιστών της συστάδας και αναθέτει τις διεργασίες που πρέπει να εκτελεστούν. Τα δύο αυτά εργαλεία είναι αποτελούν τον πυλώνα τις ενσωμάτωσης μεταξύ των υπολογιστών της συστάδας αλλά και των βιβλιοθηκών που τρέχουν στη συστάδα. Η βιβλιοθήκη Apache Spark, μέσω του προγραμματιστικού πλασίου MapReduce, παρέχει την λειτουργία την επεξεργασίας είτε σε ιστορικά δεδομένα είτε σε ροές δεδομένων και την αποθηκευσή τους στο κατανεμημένο σύστημα αρχείων του Hadoop. Στη συνέχεια το Apache Hive μας δίνει την δυνατότητα για εκτέλεση ερωτήματων σε αρχεία που βρίσκονται στο κατανεμημένο σύστημα αρχείων του Hadoop μέσω της HiveQL γλώσσας που είναι ισοδύναμη με της παραδοσιακή SQL, ενώ οι βιβλιοθήκες Apache Spark και Apache Tez αποτελούν την μηχανή εκτέλεσης (execution engine) ενός HiveQL ερωτήματος και μεταφράζουν την επερώτηση σε MapReduce διαδικασία. Κανένα από τα παραπάνω συστήματα δεν έχει την δυνατότητα επεξεργασίας γεωχωρικών ή δεδομένων κίνησης στην βασική του εκδοχή. Οι προθήκες που έγιναν περιλαμβάνουν: 1)δημιουργία συναρτήσεων για τον καθαρισμό χωροχρονικών σημείων και δημιουργία τροχιών κινούμενων αντικειμένων από τα σημεία αυτά με την βιβλιοθήκη Apache Spark, 2)χωροχρονικός καταμερισμός των τροχιών στους υπολογιστές της συστάδας, δημιουργία ευρετηρίων. Τα ευρετήρια περιλαμβανουν την χωροχρονική έκταση της διαμιρασμένης πληροφορίας και μια κωδικοποίηση βασισμένη σε τρισδιάστατα τοπικά ευρετήρια βάσει της πληροφορίας που έχει κάθε υπολογιστής με χρήση των βιβλιοθηκών Apache Spark και Apache Hadoop, 3) Δημιουργία κατάλληλων μεθόδων, για την αξιοποίηση της αποθήκευσης τους προηγούμενου βήματος, για επερωτήσης διαστήματος (range queries) και επερωτήσεων ομοιότητας (kNN queries). H σύγκριση που πραγματοποιήσαμε αφορά τη χρονική απóδοση των επερωτήσεων διαστήματος (range queries) και επερωτήσεων ομοιότητας (kNN queries), βάσει του τρόπου αποθήκευσης των δεδομένων όπως αναφέρθηκε προηγουμένως. Σε πρώτη φάση συγκρίναμε την χρονική διάρκεια ολοκλήρωσης των παραπάνω ερωτημάτων για τους διαθέσιμους τρόπους αποθήκευσης και για τους διαθέσιμους μηχανισμούς εκτέλεσης συναρτήσει του αριθμού των υπολογιστών που τρέχουν στο κατανεμημένο σύστημα (scalability). Στη συνέχεια συγκρίναμε την χρονική διάρκεια ολοκλήρωσης των παραπάνω ερωτημάτων για τους διαθέσιμους τρόπους αποθήκευσης και για τους διαθέσιμους μηχανισμούς εκτέλεσης συναρτήσει του όγκου δεδομένων (speed-up), αυξάνοντας σε κάθε βήμα των όγκο δεδομένων. Τα αποτελέσματα μας έδειξαν ότι ο πιο αποδοτικός τρόπος εκτέλεσης των ερωτημάτων με τη χρήση ενός ευρετηρίου για την διαμιρασμένη πληροφορία και στην συνέχεια η χρήση μιας κωδικοποίησης βασισμένη σε τοπικά ευρετήρια για την ανάκτηση του τελικού αποτελέσματος με μηχανισμό εκτέλεσης τη βιβλιοθήκη Apache Spark.Last decades, the need for performing advanced queries over massively produced data, such as mobility traces, in efficient and scalable ways is particularly important. This thesis describes BigSQLTraj a framework that supports efficient storing, partitioning, indexing and querying on spatial and spatio-temporal (i.e. mobility) data over a distributed engine. Every big data end-to-end application is consists of four layers, data management, data processing, data analytics and data visualization for heterogeneous data sources for batch or streaming data. This thesis focuses on data management and data processing for historical data. The first goal is finding systems that offers ready-to-use integration pipelines to take advantage of the best operation of each tool. For our implementation we chose open source big data frameworks such as Apache Hadoop, Apache Spark, Apache Hive and Apache Tez. Apache Hadoop and especially its distributed file system (HDFS) allowed all the other libraries to have a common read and write layer. On the other hand Hadoop's Resource Manager (Yarn) exploits the all the available computer resource. BigSQLTraj extending the functionality of existing spatial or spatio-temporal systems, centralized or distributed, to create two core and independent components. The first component is responsible for storing, spatiotemporal partitioning and indexing the data into a distributed file system and it is implemented on-top of Apache Spark. Many spatio-temporal partitioners and a 3D-STRtree index are implemented to support a collection of operators apart from existing partitioners and indexing methods that inherit from state-of-the-art distributed spatial and spatiotemporal systems. The second component is a distributed sql engine. He extend the functionality of HiveQL in order to achieve rapid access in such kind of data (i.e. geospatial and mobility data) and storing. Our final goal is optimizing Hive's join procedure that is required for both query types using the data structures from the first toolbox. We demonstrate the functionality of our approach and we conduct an extensive experimental study based on state-of-the-art benchmarks for mobility data. Our benchmark focuses on the total execution time of range queries and kNN queries based on the data storing model. At first we compare the temporal performance of different storing alternatives and execution engines for the entire dataset and vary the number of workers in order to review the systems scalability. Furthermore, we vary the size of our dataset and measure the execution time of the queries. To study the effect of dataset size, we split the original dataset into 5 chunks (20%, 40%, 60%, 80%, 100%). Βased on the results we come to the conclusion that the best workflow includes a global index structure for workers metadata and a local index-based encoding for storing the entire trajectories of a partition into a single column and the execution time seems to follow linear behaviour

    Challenging SQL-on-Hadoop performance with Apache Druid

    Get PDF
    In Big Data, SQL-on-Hadoop tools usually provide satisfactory performance for processing vast amounts of data, although new emerging tools may be an alternative. This paper evaluates if Apache Druid, an innovative column-oriented data store suited for online analytical processing workloads, is an alternative to some of the well-known SQL-on-Hadoop technologies and its potential in this role. In this evaluation, Druid, Hive and Presto are benchmarked with increasing data volumes. The results point Druid as a strong alternative, achieving better performance than Hive and Presto, and show the potential of integrating Hive and Druid, enhancing the potentialities of both tools.This work is supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT - Fundacao para a Ciencia e Tecnologia within Project UID/CEC/00319/2013 and by European Structural and Investment Funds in the FEDER component, COMPETE 2020 (Funding Reference: POCI-01-0247-FEDER-002814)

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data

    Get PDF
    The rapid growth in data volume and complexity has needed the adoption of advanced technologies to extract valuable insights for decision-making. This project aims to address this need by developing a comprehensive framework that combines Big Data processing, analytics, and visualization techniques to enable effective analysis of World Bank data. The problem addressed in this study is the need for a scalable and efficient Business Intelligence solution that can handle the vast amounts of data generated by the World Bank. Therefore, a Big Data architecture is implemented on a real use case for the International Bank of Reconstruction and Development. The findings of this project demonstrate the effectiveness of the proposed solution. Through the integration of Apache Spark and Apache Hive, data is processed using Extract, Transform and Load techniques, allowing for efficient data preparation. The use of Apache Kylin enables the construction of a multidimensional model, facilitating fast and interactive queries on the data. Moreover, data visualization techniques are employed to create intuitive and informative visual representations of the analysed data. The key conclusions drawn from this project highlight the advantages of a Big Data-driven Business Intelligence solution in processing and analysing World Bank data. The implemented framework showcases improved scalability, performance, and flexibility compared to traditional approaches. In conclusion, this bachelor thesis presents a Business Intelligence solution based on a Big Data architecture for processing and analysing the World Bank data. The project findings emphasize the importance of scalable and efficient data processing techniques, multidimensional modelling, and data visualization for deriving valuable insights. The application of these techniques contributes to the field by demonstrating the potential of Big Data Business Intelligence solutions in addressing the challenges associated with large-scale data analysis

    A Framework for Spatial Database Explanations

    Get PDF
    abstract: In the last few years, there has been a tremendous increase in the use of big data. Most of this data is hard to understand because of its size and dimensions. The importance of this problem can be emphasized by the fact that Big Data Research and Development Initiative was announced by the United States administration in 2012 to address problems faced by the government. Various states and cities in the US gather spatial data about incidents like police calls for service. When we query large amounts of data, it may lead to a lot of questions. For example, when we look at arithmetic relationships between queries in heterogeneous data, there are a lot of differences. How can we explain what factors account for these differences? If we define the observation as an arithmetic relationship between queries, this kind of problem can be solved by aggravation or intervention. Aggravation views the value of our observation for different set of tuples while intervention looks at the value of the observation after removing sets of tuples. We call the predicates which represent these tuples, explanations. Observations by themselves have limited importance. For example, if we observe a large number of taxi trips in a specific area, we might ask the question: Why are there so many trips here? Explanations attempt to answer these kinds of questions. While aggravation and intervention are designed for non spatial data, we propose a new approach for explaining spatially heterogeneous data. Our approach expands on aggravation and intervention while using spatial partitioning/clustering to improve explanations for spatial data. Our proposed approach was evaluated against a real-world taxi dataset as well as a synthetic disease outbreak datasets. The approach was found to outperform aggravation in precision and recall while outperforming intervention in precision.Dissertation/ThesisMasters Thesis Computer Science 201

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    A New Big Data Benchmark for OLAP Cube Design Using Data Pre-Aggregation Techniques

    Get PDF
    In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.This work has been funded by the ECLIPSE project (RTI2018-094283-B-C32) from the Spanish Ministry of Science, Innovation and Universities
    corecore