100 research outputs found

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    A PROCRUSTEAN APPROACH TO STREAM PROCESSING

    Get PDF
    The increasing demand for real-time data processing and the constantly growing data volume have contributed to the rapid evolution of Stream Processing Engines (SPEs), which are designed to continuously process data as it arrives. Low operational cost and timely delivery of results are both objectives of paramount importance for SPEs. Given the volatile and uncharted nature of data streams, achieving the aforementioned goals under fixed resources is a challenge. This calls for adaptable SPEs, which can react to fluctuations in processing demands. In the past, three techniques have been developed for improving an SPE’s ability to adapt. Those techniques are classified based on applications’ requirements on exact or approximate results: stream partitioning, and re-partitioning target exact, and load shedding targets approximate processing. Stream partitioning strives to balance load among processors, and previous techniques neglected hidden costs of distributed execution. Load Shedding lowers the accuracy of results by dropping part of the input, and previous techniques did not cope with evolving streams. Stream re-partitioning is used to reconfigure execution while processing takes place, and previous techniques did not fully utilize window semantics. In this dissertation, we put stream processing in a procrustean bed, in terms of the manner and the degree that processing takes place. To this end, we present new approaches, for window-based aggregate operators, which are applicable to both exact and approximate stream processing in modern SPEs. Our stream partitioning, re-partitioning, and load shedding solutions offer improvements in performance and accuracy on real-world data by exploiting the semantics of both data and operations. In addition, we present SPEAr, the design of an SPE that accelerates processing by delivering approximate results with accuracy guarantees and avoiding unnecessary load. Finally, we contribute a hybrid technique, ShedPart, which can further improve load balance and performance of an SPE

    Streaming the Web: Reasoning over dynamic data.

    Get PDF
    In the last few years a new research area, called stream reasoning, emerged to bridge the gap between reasoning and stream processing. While current reasoning approaches are designed to work on mainly static data, the Web is, on the other hand, extremely dynamic: information is frequently changed and updated, and new data is continuously generated from a huge number of sources, often at high rate. In other words, fresh information is constantly made available in the form of streams of new data and updates. Despite some promising investigations in the area, stream reasoning is still in its infancy, both from the perspective of models and theories development, and from the perspective of systems and tools design and implementation. The aim of this paper is threefold: (i) we identify the requirements coming from different application scenarios, and we isolate the problems they pose; (ii) we survey existing approaches and proposals in the area of stream reasoning, highlighting their strengths and limitations; (iii) we draw a research agenda to guide the future research and development of stream reasoning. In doing so, we also analyze related research fields to extract algorithms, models, techniques, and solutions that could be useful in the area of stream reasoning. © 2014 Elsevier B.V. All rights reserved

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    Constant-time sliding window framework with reduced memory footprint and efficient bulk evictions

    Get PDF
    The fast evolution of data analytics platforms has resulted in an increasing demand for real-time data stream processing. From Internet of Things applications to the monitoring of telemetry generated in large data centers, a common demand for currently emerging scenarios is the need to process vast amounts of data with low latencies, generally performing the analysis process as close to the data source as possible. Stream processing platforms are required to be malleable and absorb spikes generated by fluctuations of data generation rates. Data is usually produced as time series that have to be aggregated using multiple operators, being sliding windows one of the most common abstractions used to process data in real-time. To satisfy the above-mentioned demands, efficient stream processing techniques that aggregate data with minimal computational cost need to be developed. In this paper we present the Monoid Tree Aggregator general sliding window aggregation framework, which seamlessly combines the following features: amortized O(1) time complexity and a worst-case of O(log n) between insertions; it provides both a window aggregation mechanism and a window slide policy that are user programmable; the enforcement of the window sliding policy exhibits amortized O(1) computational cost for single evictions and supports bulk evictions with cost O(log n) ; and it requires a local memory space of O(log n) . The framework can compute aggregations over multiple data dimensions, and has been designed to support decoupling computation and data storage through the use of distributed Key-Value Stores to keep window elements and partial aggregations.This project is partially supported by the European Research Council (ERC), Spain under the European Unions Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015- 65316-P and Generalitat de Catalunya, Spain under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493).Peer ReviewedPostprint (published version

    Scalable processing of aggregate functions for data streams in resource-constrained environments

    Get PDF
    The fast evolution of data analytics platforms has resulted in an increasing demand for real-time data stream processing. From Internet of Things applications to the monitoring of telemetry generated in large datacenters, a common demand for currently emerging scenarios is the need to process vast amounts of data with low latencies, generally performing the analysis process as close to the data source as possible. Devices and sensors generate streams of data across a diversity of locations and protocols. That data usually reaches a central platform that is used to store and process the streams. Processing can be done in real time, with transformations and enrichment happening on-the-fly, but it can also happen after data is stored and organized in repositories. In the former case, stream processing technologies are required to operate on the data; in the latter batch analytics and queries are of common use. Stream processing platforms are required to be malleable and absorb spikes generated by fluctuations of data generation rates. Data is usually produced as time series that have to be aggregated using multiple operators, being sliding windows one of the most common abstractions used to process data in real-time. To satisfy the above-mentioned demands, efficient stream processing techniques that aggregate data with minimal computational cost need to be developed. However, data analytics might require to aggregate extensive windows of data. Approximate computing has been a central paradigm for decades in data analytics in order to improve the performance and reduce the needed resources, such as memory, computation time, bandwidth or energy. In exchange for these improvements, the aggregated results suffer from a level of inaccuracy that in some cases can be predicted and constrained. This doctoral thesis aims to demonstrate that it is possible to have constant-time and memory efficient aggregation functions with approximate computing mechanisms for constrained environments. In order to achieve this goal, the work has been structured in three research challenges. First we introduce a runtime to dynamically construct data stream processing topologies based on user-supplied code. These dynamic topologies are built on-the-fly using a data subscription model de¿ned by the applications that consume data. The subscription-based programing model enables multiple users to deploy their own data-processing services. On top of this runtime, we present the Amortized Monoid Tree Aggregator general sliding window aggregation framework, which seamlessly combines the following features: amortized O(1) time complexity and a worst-case of O(log n) between insertions; it provides both a window aggregation mechanism and a window slide policy that are user programmable; the enforcement of the window sliding policy exhibits amortized O(1) computational cost for single evictions and supports bulk evictions with cost O(log n); and it requires a local memory space of O(log n). The framework can compute aggregations over multiple data dimensions, and has been designed to support decoupling computation and data storage through the use of distributed Key-Value Stores to keep window elements and partial aggregations. Specially motivated by edge computing scenarios, we contribute Approximate and Amortized Monoid Tree Aggregator (A2MTA). It is, to our knowledge, the first general purpose sliding window programable framework that combines constant-time aggregations with error bounded approximate computing techniques. A2MTA uses statistical analysis of the stream data in order to perform inaccurate aggregations, providing a critical reduction of needed resources for massive stream data aggregation, and an improvement of performance.La ràpida evolució de les plataformes d'anàlisi de dades ha resultat en un increment de la demanda de processament de fluxos continus de dades en temps real. Des de la internet de les coses fins al monitoratge de telemetria generada en grans servidors, una demanda recurrent per escenaris emergents es la necessitat de processar grans quantitats de dades amb latències molt baixes, generalment fent el processat de les dades tant a prop dels origines com sigui possible. Les dades son generades com a fluxos continus per dispositius que utilitzen una varietat de localitzacions i protocols. Aquests processat de les dades s pot fer en temps real amb les transformacions efectuant-se al vol, i en aquest cas la utilització de plataformes de processat d'streams és necessària. Les plataformes de processat d'streams cal que absorbeixin pics de freqüència de dades. Les dades es generen com a series temporals que s'agreguen fent servir multiples operadors, on les finestres són l'abstracció més habitual. Per a satisfer les baixes latències i maleabilitat requerides, els operadors necesiten tenir un cost computacional mínim, inclús amb extenses finestres de dades per a agregar. La computació aproximada ha sigut durant decades un paradigma rellevant per l'anàlisi de dades on cal millorar el rendiment de diferents algorismes i reduir-ne el temps de computació, la memòria requerida, l'ample de banda o el consum energètic. A canvi d'aquestes millores, els resultats poden patir d'una falta d'exactitud que pot ser estimada i controlada. Aquesta tesi doctoral vol demostrar que es posible tenir funcions d'agregació pel processat d'streams que tinc un cost de temps constant, sigui eficient en termes de memoria i faci ús de computació aproximada. Per aconseguir aquests objectius, aquesta tesi està dividida en tres reptes. Primer presentem un entorn per a la construcció dinàmica de topologies de computació d'streams de dades utilitzant codi d'usuari. Aquestes topologies es construeixen fent servir un model de subscripció a streams, en el que les aplicación consumidores de dades amplien les topologies mentre s'estan executant. Aquest entorn permet multiples entitats ampliant una mateixa topologia. A sobre d'aquest entorn, presentem un framework de propòsit general per a l'agregació de finestres de dades anomenat AMTA (Amortized Monoid Tree Aggregator). Aquest framework combina: temps amortitzat constant per a totes les operacions, amb un cas pitjor logarítmic; programable tant en termes d'agregació com en termes d'expulsió d'elements de la finestra. L'expulsió massiva d'elements de la finestra es considera una operació atòmica, amb un cost amortitzat constant; i requereix espai en memoria local per a O(log n) elements de la finestra. Aquest framework pot computar agregacions sobre multiples dimensions de dades, i ha estat dissenyat per desacoplar la computació de les dades del seu desat, podent tenir els continguts de la finestra distribuits en diferents màquines. Motivats per la computació en l'edge (edge computing), hem contribuit A2MTA (Approximate and Amortized Monoid Tree Aggregator). Des de el nostre coneixement, es el primer framework de propòsit general per a la computació de finestres que combina un cost constant per a totes les seves operacions amb tècniques de computació aproximada amb control de l'error. A2MTA fa us d'anàlisis estadístics per a poder fer agregacions amb error limitat, reduint críticament els recursos necessaris per a la computació de grans quantitats de dades

    Proceedings of the International Workshop on Reactive Concepts in Knowledge Representation 2014

    Get PDF
    These are the proceedings of the International Workshop on Reactive Concepts in Knowledge Representation (ReactKnow 2014), which took place on August 19th, 2014 in Prague, co-located with the 21st European Conference on Artificial Intelligence (ECAI 2014)

    Monitoring data streams

    Get PDF
    Stream monitoring is concerned with analyzing data that is represented in the form of infinite streams. This field has gained prominence in recent years, as streaming data is generated in increasing volume and dimension in a variety of areas. It finds application in connection with monitoring industrial sensors, "smart" technology like smart houses and smart cars, wearable devices used for medical and physiological monitoring, but also in environmental surveillance or finance. However, stream monitoring is a challenging task due to the diverse and changing nature of the streaming data, its high volume and high dimensionality with thousands of sensors producing streams with millions of measurements over short time spans. Automated, scalable and efficient analysis of these streams can help to keep track of important events, highlight relevant aspects and provide better insights into the monitored system. In this thesis, we propose techniques adapted to these tasks in supervised and unsupervised settings, in particular Stream Classification and Stream Dependency Monitoring. After a motivating introduction, we introduce concepts related to streaming data and discuss technological frameworks that have emerged to deal with streaming data in the second chapter of this thesis. We introduce the notion of information theoretical entropy as a useful basis for data monitoring in the third chapter. In the second part of the thesis, we present Probabilistic Hoeffding Trees, a novel approach towards stream classification. We will show how probabilistic learning greatly improves the flexibility of decision trees and their ability to adapt to changes in data streams. The general technique is applicable to a variety of classification models and fast to compute without significantly greater memory cost compared to regular Hoeffding Trees. We show that our technique achieves better or on-par results to current state-of-the-art tree classification models on a variety of large, synthetic and real life data sets. In the third part of the thesis, we concentrate on unsupervised monitoring of data streams. We will use mutual information as entropic measure to identify the most important relationships in a monitored system. By using the powerful concept of mutual information we can, first, capture relevant aspects in a great variety of data sources with different underlying concepts and possible relationships and, second, analyze theoretical and computational complexity. We present the MID and DIMID algorithms. They perform extremely efficient on high dimensional data streams and provide accurate results, outperforming state-of-the-art algorithms for dependency monitoring. In the fourth part of this thesis, we introduce delayed relationships as a further feature in the dependency analysis. In reality, the phenomena monitored by e.g. some type of sensor might depend on another, but measurable effects can be delayed. This delay might be due to technical reasons, i.e. different stream processing speeds, or because the effects actually appear delayed over time. We present Loglag, the first algorithm that monitors dependency with respect to an optimal delay. It utilizes several approximation techniques to achieve competitive resource requirements. We demonstrate its scalability and accuracy on real world data, and also give theoretical guarantees to its accuracy

    Monitoring data streams

    Get PDF
    Stream monitoring is concerned with analyzing data that is represented in the form of infinite streams. This field has gained prominence in recent years, as streaming data is generated in increasing volume and dimension in a variety of areas. It finds application in connection with monitoring industrial sensors, "smart" technology like smart houses and smart cars, wearable devices used for medical and physiological monitoring, but also in environmental surveillance or finance. However, stream monitoring is a challenging task due to the diverse and changing nature of the streaming data, its high volume and high dimensionality with thousands of sensors producing streams with millions of measurements over short time spans. Automated, scalable and efficient analysis of these streams can help to keep track of important events, highlight relevant aspects and provide better insights into the monitored system. In this thesis, we propose techniques adapted to these tasks in supervised and unsupervised settings, in particular Stream Classification and Stream Dependency Monitoring. After a motivating introduction, we introduce concepts related to streaming data and discuss technological frameworks that have emerged to deal with streaming data in the second chapter of this thesis. We introduce the notion of information theoretical entropy as a useful basis for data monitoring in the third chapter. In the second part of the thesis, we present Probabilistic Hoeffding Trees, a novel approach towards stream classification. We will show how probabilistic learning greatly improves the flexibility of decision trees and their ability to adapt to changes in data streams. The general technique is applicable to a variety of classification models and fast to compute without significantly greater memory cost compared to regular Hoeffding Trees. We show that our technique achieves better or on-par results to current state-of-the-art tree classification models on a variety of large, synthetic and real life data sets. In the third part of the thesis, we concentrate on unsupervised monitoring of data streams. We will use mutual information as entropic measure to identify the most important relationships in a monitored system. By using the powerful concept of mutual information we can, first, capture relevant aspects in a great variety of data sources with different underlying concepts and possible relationships and, second, analyze theoretical and computational complexity. We present the MID and DIMID algorithms. They perform extremely efficient on high dimensional data streams and provide accurate results, outperforming state-of-the-art algorithms for dependency monitoring. In the fourth part of this thesis, we introduce delayed relationships as a further feature in the dependency analysis. In reality, the phenomena monitored by e.g. some type of sensor might depend on another, but measurable effects can be delayed. This delay might be due to technical reasons, i.e. different stream processing speeds, or because the effects actually appear delayed over time. We present Loglag, the first algorithm that monitors dependency with respect to an optimal delay. It utilizes several approximation techniques to achieve competitive resource requirements. We demonstrate its scalability and accuracy on real world data, and also give theoretical guarantees to its accuracy

    Raphtory: Modelling, Maintenance and Analysis of Distributed Temporal Graphs.

    Get PDF
    PhD ThesesTemporal graphs capture the development of relationships within data throughout time. This model ts naturally within a streaming architecture, where new events can be inserted directly into the graph upon arrival from a data source and be compared to related entities or historical state. However, the majority of graph processing systems only consider traditional graph analysis on static data, whilst those which do expand past this often only support batched updating and delta analysis across graph snapshots. In this work we de ne a temporal property graph model and the semantics for updating it in both a distributed and non-distributed context. We have built Raphtory, a distributed temporal graph analytics platform which maintains the full graph history in memory, leveraging the de ned update semantics to insert streamed events directly into the model without batching or centralised ordering. In parallel with the ingestion, traditional and time-aware analytics may be performed on the most up-to-date version of the graph, as well as any point throughout its history. The depth of history viewed from the perspective of a time point may also be varied to explore both short and long term patterns within the data. Through this we extract novel insights over a variety of use cases, including phenomena never seen before in social networks. Finally, we demonstrate Raphtory's ability to scale both vertically and horizontally, handling consistent throughput in excess of 100,000 updates a second alongside the ingestion and maintenance of graphs built from billions of events
    corecore