69 research outputs found

    Stream Processing Systems Benchmark: StreamBench

    Get PDF
    Batch processing technologies (Such as MapReduce, Hive, Pig) have matured and been widely used in the industry. These systems solved the issue processing big volumes of data successfully. However, first big amount of data need to be collected and stored in a database or file system. That is very time-consuming. Then it takes time to finish batch processing analysis jobs before get any results. While there are many cases that need analysed results from unbounded sequence of data in seconds or sub-seconds. To satisfy the increasing demand of processing such streaming data, several streaming processing systems are implemented and widely adopted, such as Apache Storm, Apache Spark, IBM InfoSphere Streams, and Apache Flink. They all support online stream processing, high scalability, and tasks monitoring. While how to evaluate stream processing systems before choosing one in production development is an open question. In this thesis, we introduce StreamBench, a benchmark framework to facilitate performance comparisons of stream processing systems. A common API component and a core set of workloads are defined. We implement the common API and run benchmarks for three widely used open source stream processing systems: Apache Storm, Flink, and Spark Streaming. A key feature of the StreamBench framework is that it is extensible -- it supports easy definition of new workloads, in addition to making it easy to benchmark new stream processing systems

    A Comparison of Real Time Stream Processing Frameworks

    Get PDF
    The need to process the ever-expanding volumes of information being generated daily in the modern world is driving radical changes in traditional data analysis techniques. As a result of this, a number of open source tools for handling real-time data streams has become available in recent years. Four, in particular, have gained significant traction: Apache Flink, Apache Samza, Apache Spark and Apache Storm. Despite the rising popularity of these frameworks, however, there are few studies that analyse their performance in terms of important metrics, such as throughput and latency. This study aims to correct this, by running several benchmarks against these frameworks

    Building interactive distributed processing applications at a global scale

    Get PDF
    Along with the continuous engagement with technology, many latency-sensitive interactive applications have emerged, e.g., global content sharing in social networks, adaptive lights/temperatures in smart buildings, and online multi-user games. These applications typically process a massive amount of data at a global scale. In this cases, distributing storage and processing is key to handling the large scale. Distribution necessitates handling two main aspects: a) the placement of data/processing and b) the data motion across the distributed locations. However, handling the distribution while meeting latency guarantees at large scale comes with many challenges around hiding heterogeneity and diversity of devices and workload, handling dynamism in the environment, providing continuous availability despite failures, and supporting persistent large state. In this thesis, we show how latency-driven designs for placement and data-motion can be used to build production infrastructures for interactive applications at a global scale, while also being able to address myriad challenges on heterogeneity, dynamism, state, and availability. We demonstrate a latency-driven approach is general and applicable at all layers of the stack: from storage, to processing, down to networking. We designed and built four distinct systems across the spectrum. We have developed Ambry (collaboration with LinkedIn), a geo-distributed storage system for interactive data sharing across the globe. Ambry is LinkedIn's mainstream production system for all its media content running across 4 datacenters and over 500 million users. Ambry minimizes user perceived latency via smart data placement and propagation. Second, we have built two processing systems, a traditional model, Samza, and the avant-garde model, Steel. Samza (collaboration with LinkedIn) is a production stream processing framework used at 15 companies (including LinkedIn, Uber, Netflix, and TripAdvisor), powering >200 pipelines at LinkedIn alone. Samza minimizes the impact of data motion on the end-to-end latency, thus, enabling large persistent state (100s of TB) along with processing. Steel (collaboration with Microsoft) extends processing to the emerging edge. Integrated with Azure, Steel dynamically optimizes placement and data-motion across the entire edge-cloud environment. Finally, we have designed FreeFlow, a high performance networking mechanisms for containers. Using the container placement, FreeFlow opportunistically bypasses networking layers, minimizing data motion and reducing latency (up to 3 orders of magnitude)

    Scalability Benchmarking of Cloud-Native Applications Applied to Event-Driven Microservices

    Get PDF
    Cloud-native applications constitute a recent trend for designing large-scale software systems. This thesis introduces the Theodolite benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, their frameworks, configurations, and deployments. The benchmarking method is applied to event-driven microservices, a specific type of cloud-native applications that employ distributed stream processing frameworks to scale with massive data volumes. Extensive experimental evaluations benchmark and compare the scalability of various stream processing frameworks under different configurations and deployments, including different public and private cloud environments. These experiments show that the presented benchmarking method provides statistically sound results in an adequate amount of time. In addition, three case studies demonstrate that the Theodolite benchmarking method can be applied to a wide range of applications beyond stream processing

    DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams

    Full text link
    In a data stream management system (DSMS), users register continuous queries, and receive result updates as data arrive and expire. We focus on applications with real-time constraints, in which the user must receive each result update within a given period after the update occurs. To handle fast data, the DSMS is commonly placed on top of a cloud infrastructure. Because stream properties such as arrival rates can fluctuate unpredictably, cloud resources must be dynamically provisioned and scheduled accordingly to ensure real-time response. It is quite essential, for the existing systems or future developments, to possess the ability of scheduling resources dynamically according to the current workload, in order to avoid wasting resources, or failing in delivering correct results on time. Motivated by this, we propose DRS, a novel dynamic resource scheduler for cloud-based DSMSs. DRS overcomes three fundamental challenges: (a) how to model the relationship between the provisioned resources and query response time (b) where to best place resources; and (c) how to measure system load with minimal overhead. In particular, DRS includes an accurate performance model based on the theory of \emph{Jackson open queueing networks} and is capable of handling \emph{arbitrary} operator topologies, possibly with loops, splits and joins. Extensive experiments with real data confirm that DRS achieves real-time response with close to optimal resource consumption.Comment: This is the our latest version with certain modificatio

    Weiterentwicklung analytischer Datenbanksysteme

    Get PDF
    This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens

    Scalable processing of aggregate functions for data streams in resource-constrained environments

    Get PDF
    The fast evolution of data analytics platforms has resulted in an increasing demand for real-time data stream processing. From Internet of Things applications to the monitoring of telemetry generated in large datacenters, a common demand for currently emerging scenarios is the need to process vast amounts of data with low latencies, generally performing the analysis process as close to the data source as possible. Devices and sensors generate streams of data across a diversity of locations and protocols. That data usually reaches a central platform that is used to store and process the streams. Processing can be done in real time, with transformations and enrichment happening on-the-fly, but it can also happen after data is stored and organized in repositories. In the former case, stream processing technologies are required to operate on the data; in the latter batch analytics and queries are of common use. Stream processing platforms are required to be malleable and absorb spikes generated by fluctuations of data generation rates. Data is usually produced as time series that have to be aggregated using multiple operators, being sliding windows one of the most common abstractions used to process data in real-time. To satisfy the above-mentioned demands, efficient stream processing techniques that aggregate data with minimal computational cost need to be developed. However, data analytics might require to aggregate extensive windows of data. Approximate computing has been a central paradigm for decades in data analytics in order to improve the performance and reduce the needed resources, such as memory, computation time, bandwidth or energy. In exchange for these improvements, the aggregated results suffer from a level of inaccuracy that in some cases can be predicted and constrained. This doctoral thesis aims to demonstrate that it is possible to have constant-time and memory efficient aggregation functions with approximate computing mechanisms for constrained environments. In order to achieve this goal, the work has been structured in three research challenges. First we introduce a runtime to dynamically construct data stream processing topologies based on user-supplied code. These dynamic topologies are built on-the-fly using a data subscription model de¿ned by the applications that consume data. The subscription-based programing model enables multiple users to deploy their own data-processing services. On top of this runtime, we present the Amortized Monoid Tree Aggregator general sliding window aggregation framework, which seamlessly combines the following features: amortized O(1) time complexity and a worst-case of O(log n) between insertions; it provides both a window aggregation mechanism and a window slide policy that are user programmable; the enforcement of the window sliding policy exhibits amortized O(1) computational cost for single evictions and supports bulk evictions with cost O(log n); and it requires a local memory space of O(log n). The framework can compute aggregations over multiple data dimensions, and has been designed to support decoupling computation and data storage through the use of distributed Key-Value Stores to keep window elements and partial aggregations. Specially motivated by edge computing scenarios, we contribute Approximate and Amortized Monoid Tree Aggregator (A2MTA). It is, to our knowledge, the first general purpose sliding window programable framework that combines constant-time aggregations with error bounded approximate computing techniques. A2MTA uses statistical analysis of the stream data in order to perform inaccurate aggregations, providing a critical reduction of needed resources for massive stream data aggregation, and an improvement of performance.La ràpida evolució de les plataformes d'anàlisi de dades ha resultat en un increment de la demanda de processament de fluxos continus de dades en temps real. Des de la internet de les coses fins al monitoratge de telemetria generada en grans servidors, una demanda recurrent per escenaris emergents es la necessitat de processar grans quantitats de dades amb latències molt baixes, generalment fent el processat de les dades tant a prop dels origines com sigui possible. Les dades son generades com a fluxos continus per dispositius que utilitzen una varietat de localitzacions i protocols. Aquests processat de les dades s pot fer en temps real amb les transformacions efectuant-se al vol, i en aquest cas la utilització de plataformes de processat d'streams és necessària. Les plataformes de processat d'streams cal que absorbeixin pics de freqüència de dades. Les dades es generen com a series temporals que s'agreguen fent servir multiples operadors, on les finestres són l'abstracció més habitual. Per a satisfer les baixes latències i maleabilitat requerides, els operadors necesiten tenir un cost computacional mínim, inclús amb extenses finestres de dades per a agregar. La computació aproximada ha sigut durant decades un paradigma rellevant per l'anàlisi de dades on cal millorar el rendiment de diferents algorismes i reduir-ne el temps de computació, la memòria requerida, l'ample de banda o el consum energètic. A canvi d'aquestes millores, els resultats poden patir d'una falta d'exactitud que pot ser estimada i controlada. Aquesta tesi doctoral vol demostrar que es posible tenir funcions d'agregació pel processat d'streams que tinc un cost de temps constant, sigui eficient en termes de memoria i faci ús de computació aproximada. Per aconseguir aquests objectius, aquesta tesi està dividida en tres reptes. Primer presentem un entorn per a la construcció dinàmica de topologies de computació d'streams de dades utilitzant codi d'usuari. Aquestes topologies es construeixen fent servir un model de subscripció a streams, en el que les aplicación consumidores de dades amplien les topologies mentre s'estan executant. Aquest entorn permet multiples entitats ampliant una mateixa topologia. A sobre d'aquest entorn, presentem un framework de propòsit general per a l'agregació de finestres de dades anomenat AMTA (Amortized Monoid Tree Aggregator). Aquest framework combina: temps amortitzat constant per a totes les operacions, amb un cas pitjor logarítmic; programable tant en termes d'agregació com en termes d'expulsió d'elements de la finestra. L'expulsió massiva d'elements de la finestra es considera una operació atòmica, amb un cost amortitzat constant; i requereix espai en memoria local per a O(log n) elements de la finestra. Aquest framework pot computar agregacions sobre multiples dimensions de dades, i ha estat dissenyat per desacoplar la computació de les dades del seu desat, podent tenir els continguts de la finestra distribuits en diferents màquines. Motivats per la computació en l'edge (edge computing), hem contribuit A2MTA (Approximate and Amortized Monoid Tree Aggregator). Des de el nostre coneixement, es el primer framework de propòsit general per a la computació de finestres que combina un cost constant per a totes les seves operacions amb tècniques de computació aproximada amb control de l'error. A2MTA fa us d'anàlisis estadístics per a poder fer agregacions amb error limitat, reduint críticament els recursos necessaris per a la computació de grans quantitats de dades
    corecore