932 research outputs found

    Elastic Infrastructure for Joining Stream Data

    Get PDF
    Σε αυτή την εργασία στοχεύουμε στη βελτίωση της απόδοσης των εργασιών επιχειρηματικής ευφυΐας σημαντικό κομμάτι των οποίων είναι οι εργασίες Εξόρυξη-Μετασχηματισμού-Φόρτωσης (ETL). Στην συντριπτική πλειοψηφία οι διαδικασίες ETL περιλαμβάνουν πολύ ακριβά joins μεταξύ δεδομένων ροών και σχεσιακών δεδομένων. Παρουσιάζουμε μια αρχιτεκτονική για την ελαστική προσαρμογή του αλγορίθμου Semi-Streamed Index Join (SSIJ) που με επιτυχία αντιμετωπίζει εργασίες τύπου-ETL. Υιοθετούμε μια ελαστική κατανεμημένη αρχιτεκτονική που το βασικό της μέλημα είναι η δίκαιη διανομή του υπολογιστικού φόρτου του SSIJ σε πολλαπλούς κόμβους. Έχουμε αναπτύξει αλγόριθμους που κατευθύνουν αποδοτικά την ροή των δεδομένων μέσα συστάδες κόμβων, προκειμένου να κάνουμε αποτελεσματικό caching. Έχουμε επίσης τη δυνατότητα να προσθέσουμε ή να αφαιρέσουμε δυναμικά υπολογιστικούς κόμβους ανάλογ α με τον όγκο της κυκλοφορίας προκειμένου να διατηρηθεί η απόδοση του συστήματος σε σταθερά επίπεδα και ταυτόχρονα να μην σπαταλώνται πολύτιμοι πόροι. Στην υλοποίησή της υποδομής χρησιμοποιήσαμε container cluster με Docker μαζί το Kubernetes framework για την οργάνωση και διαχείριση της υπολογιστικής συστάδας. Ο πειραματισμός πραγματοποιήθηκε στο Google Cloud.In this work we aim to improve the performance of business intelligence applications, an important part of which is the Extraction-Transformation-Loading (ETL) processes. The vast majority of ETL processes involve very expensive joins between 'fresh' stream data flows and disk-stored relational data. We based our solution on an existing algorithm called Semi-Streamed Index Join algorithm (SSIJ), which successfully handles ETL transactions on a single computer node with very promissing performance results. But we live in the era of information explosion. Large corporations have the ability to collect and store TBs of data every day. It is therefore necessary to move to a solution that uses multiple computing nodes. We developed an elastic distributed architecture that its main concern is the fair distribution of the computational load of SSIJ to multiple nodes. We have developed algorithms that efficiently direct the flow of the steram into clusters nodes in order to make caching as effective as possible. We also have the ability to add or remove compute nodes dynamically depending on the volume and speed of the sream traffic in order to maintain system performance stable and simultaneously avoid wasting valuable resources. In the implementation of this work we used containerized computing nodes which can operate in a cluster of virtual machines. We were based in Docker technology for containerizing our computing nodes. Our experiments were conducted in Google Cloud Platform. For the organization and scheduling of the Docker containers used the Kubernetes platform

    Optimising HYBRIDJOIN to Process Semi-Stream Data in Near-real-time Data Warehousing

    Get PDF
    Near-real-time data warehousing plays an essential role for decision making in organizations where latest data is to be fed from various data sources on near-real-time basis. The stream of sales data coming from data sources needs to be transformed to the data warehouse format using disk-based master data. This transformation process is a challenging task due to slow disk access rate as compare to the fast stream data. For this purpose, an adaptive semi-stream join algorithm called HYBRIDJOIN (Hybrid Join) is presented in the literature. The algorithm uses a single buffer to load partitions from the master data. Therefore, the algorithm has to wait until the next disk partition overwrites the existing partition in the buffer. As the cost of loading the disk partition into the buffer is a major cost in the total algorithm’s processing cost, this leaves the performance of the algorithm sub-optimal. This paper presents optimisation of existing HYBRIDJOIN by introducing another buffer. This enables the algorithm to load the second buffer while the first one is under join execution. This reduces the time that the algorithm wait for loading of master data partition and consequently, this improves the performance of the algorithm significantly

    A Memory-Optimal Many-To-Many Semi-Stream Join

    Get PDF

    Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems

    Full text link
    Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks. In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.Comment: 10 pages, 13 figures, 2 table

    Recurring Query Processing on Big Data

    Get PDF
    The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    corecore