6 research outputs found

    Optimising HYBRIDJOIN to Process Semi-Stream Data in Near-real-time Data Warehousing

    Get PDF
    Near-real-time data warehousing plays an essential role for decision making in organizations where latest data is to be fed from various data sources on near-real-time basis. The stream of sales data coming from data sources needs to be transformed to the data warehouse format using disk-based master data. This transformation process is a challenging task due to slow disk access rate as compare to the fast stream data. For this purpose, an adaptive semi-stream join algorithm called HYBRIDJOIN (Hybrid Join) is presented in the literature. The algorithm uses a single buffer to load partitions from the master data. Therefore, the algorithm has to wait until the next disk partition overwrites the existing partition in the buffer. As the cost of loading the disk partition into the buffer is a major cost in the total algorithm’s processing cost, this leaves the performance of the algorithm sub-optimal. This paper presents optimisation of existing HYBRIDJOIN by introducing another buffer. This enables the algorithm to load the second buffer while the first one is under join execution. This reduces the time that the algorithm wait for loading of master data partition and consequently, this improves the performance of the algorithm significantly

    Elastic Infrastructure for Joining Stream Data

    Get PDF
    Σε αυτή την εργασία στοχεύουμε στη βελτίωση της απόδοσης των εργασιών επιχειρηματικής ευφυΐας σημαντικό κομμάτι των οποίων είναι οι εργασίες Εξόρυξη-Μετασχηματισμού-Φόρτωσης (ETL). Στην συντριπτική πλειοψηφία οι διαδικασίες ETL περιλαμβάνουν πολύ ακριβά joins μεταξύ δεδομένων ροών και σχεσιακών δεδομένων. Παρουσιάζουμε μια αρχιτεκτονική για την ελαστική προσαρμογή του αλγορίθμου Semi-Streamed Index Join (SSIJ) που με επιτυχία αντιμετωπίζει εργασίες τύπου-ETL. Υιοθετούμε μια ελαστική κατανεμημένη αρχιτεκτονική που το βασικό της μέλημα είναι η δίκαιη διανομή του υπολογιστικού φόρτου του SSIJ σε πολλαπλούς κόμβους. Έχουμε αναπτύξει αλγόριθμους που κατευθύνουν αποδοτικά την ροή των δεδομένων μέσα συστάδες κόμβων, προκειμένου να κάνουμε αποτελεσματικό caching. Έχουμε επίσης τη δυνατότητα να προσθέσουμε ή να αφαιρέσουμε δυναμικά υπολογιστικούς κόμβους ανάλογ α με τον όγκο της κυκλοφορίας προκειμένου να διατηρηθεί η απόδοση του συστήματος σε σταθερά επίπεδα και ταυτόχρονα να μην σπαταλώνται πολύτιμοι πόροι. Στην υλοποίησή της υποδομής χρησιμοποιήσαμε container cluster με Docker μαζί το Kubernetes framework για την οργάνωση και διαχείριση της υπολογιστικής συστάδας. Ο πειραματισμός πραγματοποιήθηκε στο Google Cloud.In this work we aim to improve the performance of business intelligence applications, an important part of which is the Extraction-Transformation-Loading (ETL) processes. The vast majority of ETL processes involve very expensive joins between 'fresh' stream data flows and disk-stored relational data. We based our solution on an existing algorithm called Semi-Streamed Index Join algorithm (SSIJ), which successfully handles ETL transactions on a single computer node with very promissing performance results. But we live in the era of information explosion. Large corporations have the ability to collect and store TBs of data every day. It is therefore necessary to move to a solution that uses multiple computing nodes. We developed an elastic distributed architecture that its main concern is the fair distribution of the computational load of SSIJ to multiple nodes. We have developed algorithms that efficiently direct the flow of the steram into clusters nodes in order to make caching as effective as possible. We also have the ability to add or remove compute nodes dynamically depending on the volume and speed of the sream traffic in order to maintain system performance stable and simultaneously avoid wasting valuable resources. In the implementation of this work we used containerized computing nodes which can operate in a cluster of virtual machines. We were based in Docker technology for containerizing our computing nodes. Our experiments were conducted in Google Cloud Platform. For the organization and scheduling of the Docker containers used the Kubernetes platform

    Big data velocity management-from stream to warehouse via high performance memory optimized index join

    Get PDF
    Efficient resource optimization is critical to manage the velocity and volume of real-time streaming data in near-real-time data warehousing and business intelligence. This article presents a memory optimisation algorithm for rapidly joining streaming data with persistent master data in order to reduce data latency. Typically during the transformation phase of ETL (Extraction, Transformation, and Loading) a stream of transactional data needs to be joined with master data stored on disk. To implement this process, a semi-stream join operator is commonly used. Most semi-stream join operators cache frequent parts of the master data to improve their performance, this process requires careful distribution of allocated memory among the components of the join operator. This article presents a cache inequality approach to optimise cache size and memory. To test this approach, we present a novel Memory Optimal Index-based Join (MOIJ) algorithm. MOIJ supports many-to-many types of joins and adapts to dynamic streaming data. We also present a cost model for MOIJ and compare the performance with existing algorithms empirically as well as analytically. We envisage the enhanced ability of processing near-real-time streaming data using minimal memory will reduce latency in processing big data and will contribute to the development of highperformance real-time business intelligence systems
    corecore