130 research outputs found

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    Quality-of-Service-Aware Data Stream Processing

    Get PDF
    Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources

    Link-based similarity search to fight web spam

    Get PDF
    www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

    Speculation in Parallel and Distributed Event Processing Systems

    Get PDF
    Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data. ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful. As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs. Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations. For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1 1.1 Event stream processing systems ......................... 1 1.2 Running example ................................. 3 1.3 Challenges and contributions ........................... 4 1.4 Outline ...................................... 6 2 Background 7 2.1 Event stream processing ............................. 7 2.1.1 State in operators: Windows and synopses ............................ 8 2.1.2 Types of operators ............................ 12 2.1.3 Our prototype system........................... 13 2.2 Software transactional memory.......................... 18 2.2.1 Overview ................................. 18 2.2.2 Memory operations............................ 19 2.3 Fault tolerance in distributed systems ...................................... 23 2.3.1 Failure model and failure detection ...................................... 23 2.3.2 Recovery semantics............................ 24 2.3.3 Active and passive replication ...................... 24 2.4 Summary ..................................... 26 3 Extending event stream processing systems with speculation 27 3.1 Motivation..................................... 27 3.2 Goals ....................................... 28 3.3 Local versus distributed speculation ....................... 29 3.4 Models and assumptions ............................. 29 3.4.1 Operators................................. 30 3.4.2 Events................................... 30 3.4.3 Failures .................................. 31 4 Local speculation 33 4.1 Overview ..................................... 33 4.2 Requirements ................................... 35 4.2.1 Order ................................... 35 4.2.2 Aborts................................... 37 4.2.3 Optimism control ............................. 38 4.2.4 Notifications ............................... 39 4.3 Applications.................................... 40 4.3.1 Out-of-order processing ......................... 40 4.3.2 Optimistic parallelization......................... 42 4.4 Extensions..................................... 44 4.4.1 Avoiding unnecessary aborts ....................... 44 4.4.2 Making aborts unnecessary........................ 45 4.5 Evaluation..................................... 47 4.5.1 Overhead of speculation ......................... 47 4.5.2 Cost of misspeculation .......................... 50 4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53 4.5.4 Behavior with example operators .................... 57 4.6 Summary ..................................... 60 5 Distributed speculation 63 5.1 Overview ..................................... 63 5.2 Requirements ................................... 64 5.2.1 Speculative events ............................ 64 5.2.2 Speculative accesses ........................... 69 5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72 5.3 Applications .................................... 75 5.3.1 Passive replication and rollback recovery ................................ 75 5.3.2 Active replication ............................. 80 5.4 Extensions ..................................... 82 5.4.1 Active replication and software bugs ..................................... 82 5.4.2 Enabling operators to output multiple events ........................ 87 5.5 Evaluation .................................... 87 5.5.1 Passive replication ............................ 88 5.5.2 Active replication ............................. 88 5.6 Summary ..................................... 93 6 Related work 95 6.1 Event stream processing engines ......................... 95 6.2 Parallelization and optimistic computing ................................ 97 6.2.1 Speculation ................................ 97 6.2.2 Optimistic parallelization ......................... 98 6.2.3 Parallelization in event processing .................................... 99 6.2.4 Speculation in event processing ..................... 99 6.3 Fault tolerance .................................. 100 6.3.1 Passive replication and rollback recovery ............................... 100 6.3.2 Active replication ............................ 101 6.3.3 Fault tolerance in event stream processing systems ............. 103 7 Conclusions 105 7.1 Summary of contributions ............................ 105 7.2 Challenges and future work ............................ 106 Appendices Publications 107 Pseudocode for the consensus protocol 10

    Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing

    Get PDF
    Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account

    Knowledge Modelling and Learning through Cognitive Networks

    Get PDF
    One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot

    Speculation in Parallel and Distributed Event Processing Systems

    Get PDF
    Event stream processing (ESP) applications enable the real-time processing of continuous flows of data. Algorithmic trading, network monitoring, and processing data from sensor networks are good examples of applications that traditionally rely upon ESP systems. In addition, technological advances are resulting in an increasing number of devices that are network enabled, producing information that can be automatically collected and processed. This increasing availability of on-line data motivates the development of new and more sophisticated applications that require low-latency processing of large volumes of data. ESP applications are composed of an acyclic graph of operators that is traversed by the data. Inside each operator, the events can be transformed, aggregated, enriched, or filtered out. Some of these operations depend only on the current input events, such operations are called stateless. Other operations, however, depend not only on the current event, but also on a state built during the processing of previous events. Such operations are, therefore, named stateful. As the number of ESP applications grows, there are increasingly strong requirements, which are often difficult to satisfy. In this dissertation, we address two challenges created by the use of stateful operations in a ESP application: (i) stateful operators can be bottlenecks because they are sensitive to the order of events and cannot be trivially parallelized by replication; and (ii), if failures are to be tolerated, the accumulated state of an stateful operator needs to be saved, saving this state traditionally imposes considerable performance costs. Our approach is to evaluate the use of speculation to address these two issues. For handling ordering and parallelization issues in a stateful operator, we propose a speculative approach that both reduces latency when the operator must wait for the correct ordering of the events and improves throughput when the operation in hand is parallelizable. In addition, our approach does not require that user understand concurrent programming or that he or she needs to consider out-of-order execution when writing the operations. For fault-tolerant applications, traditional approaches have imposed prohibitive performance costs due to pessimistic schemes. We extend such approaches, using speculation to mask the cost of fault tolerance.:1 Introduction 1 1.1 Event stream processing systems ......................... 1 1.2 Running example ................................. 3 1.3 Challenges and contributions ........................... 4 1.4 Outline ...................................... 6 2 Background 7 2.1 Event stream processing ............................. 7 2.1.1 State in operators: Windows and synopses ............................ 8 2.1.2 Types of operators ............................ 12 2.1.3 Our prototype system........................... 13 2.2 Software transactional memory.......................... 18 2.2.1 Overview ................................. 18 2.2.2 Memory operations............................ 19 2.3 Fault tolerance in distributed systems ...................................... 23 2.3.1 Failure model and failure detection ...................................... 23 2.3.2 Recovery semantics............................ 24 2.3.3 Active and passive replication ...................... 24 2.4 Summary ..................................... 26 3 Extending event stream processing systems with speculation 27 3.1 Motivation..................................... 27 3.2 Goals ....................................... 28 3.3 Local versus distributed speculation ....................... 29 3.4 Models and assumptions ............................. 29 3.4.1 Operators................................. 30 3.4.2 Events................................... 30 3.4.3 Failures .................................. 31 4 Local speculation 33 4.1 Overview ..................................... 33 4.2 Requirements ................................... 35 4.2.1 Order ................................... 35 4.2.2 Aborts................................... 37 4.2.3 Optimism control ............................. 38 4.2.4 Notifications ............................... 39 4.3 Applications.................................... 40 4.3.1 Out-of-order processing ......................... 40 4.3.2 Optimistic parallelization......................... 42 4.4 Extensions..................................... 44 4.4.1 Avoiding unnecessary aborts ....................... 44 4.4.2 Making aborts unnecessary........................ 45 4.5 Evaluation..................................... 47 4.5.1 Overhead of speculation ......................... 47 4.5.2 Cost of misspeculation .......................... 50 4.5.3 Out-of-order and parallel processing micro benchmarks ........... 53 4.5.4 Behavior with example operators .................... 57 4.6 Summary ..................................... 60 5 Distributed speculation 63 5.1 Overview ..................................... 63 5.2 Requirements ................................... 64 5.2.1 Speculative events ............................ 64 5.2.2 Speculative accesses ........................... 69 5.2.3 Reliable ordered broadcast with optimistic delivery .................. 72 5.3 Applications .................................... 75 5.3.1 Passive replication and rollback recovery ................................ 75 5.3.2 Active replication ............................. 80 5.4 Extensions ..................................... 82 5.4.1 Active replication and software bugs ..................................... 82 5.4.2 Enabling operators to output multiple events ........................ 87 5.5 Evaluation .................................... 87 5.5.1 Passive replication ............................ 88 5.5.2 Active replication ............................. 88 5.6 Summary ..................................... 93 6 Related work 95 6.1 Event stream processing engines ......................... 95 6.2 Parallelization and optimistic computing ................................ 97 6.2.1 Speculation ................................ 97 6.2.2 Optimistic parallelization ......................... 98 6.2.3 Parallelization in event processing .................................... 99 6.2.4 Speculation in event processing ..................... 99 6.3 Fault tolerance .................................. 100 6.3.1 Passive replication and rollback recovery ............................... 100 6.3.2 Active replication ............................ 101 6.3.3 Fault tolerance in event stream processing systems ............. 103 7 Conclusions 105 7.1 Summary of contributions ............................ 105 7.2 Challenges and future work ............................ 106 Appendices Publications 107 Pseudocode for the consensus protocol 10

    Allyn, A Recommender Assistant for Online Bookstores

    Get PDF
    Treballs Finals del Grau d'Economia i Estadística. Doble titulació interuniversitària, Universitat de Barcelona i Universitat Politècnica de Catalunya. Curs: 2017-2018. Tutors: Esteban Vegas Lozano; Salvador Torra Porras(eng) Recommender Systems are information filtering engines used to estimate user preferences on items they have not seen: books, movies, restaurants or other things for which individuals have different tastes. Collaborative and Content-based Filtering have been the two popular memory-based methods to retrieve recommendations but these suffer from some limitations and might fail to provide effective recommendations. In this project we present several variations of Artificial Neural Networks, and in particular, of Autoencoders to generate model-based predictions for the users. We empirically show that a hybrid approach combining this model with other filtering engines provides a promising solution when compared to a standalone memory-based Collaborative Filtering Recommender. To wrap up the project, a chatbot connected to an e-commerce platform has been implemented so that, using Artificial Intelligence, it can retrieve recommendations to users.(cat) Els Sistemes de Recomanació són motors de filtratge de la informació que permeten estimar les preferències dels usuaris sobre ítems que no coneixen a priori. Aquests poden ser des de llibres o películes fins a restaurants o qualsevol altre element en el qual els usuaris puguin presentar gustos diferenciats. El present projecte es centra en la recomanació de llibres. Es comença a parlar dels Sistemes de Recomanació al voltant de 1990 però és durant la darrera dècada amb el boom de la informació i les dades massives que comencen a tenir major repercussió. Tradicionalment, els mètodes utilitzats en aquests sistemes eren dos: el Filtratge Col·laboratiu i el Filtratge basat en Contingut. Tanmateix, ambdós són mètodes basats en memòria, fet que suposa diverses limitacions que poden arribar a portar a no propocionar recomanacions de manera eficient o precisa. En aquest projecte es presenten diverses variacions de Xarxes Neuronals Artificials per a generar prediccions basades en models. En concret, es desenvolupen Autoencoders, una estructura particular d’aquestes que es caracteritza per tenir la mateixa entrada i sortida. D’aquesta manera, els Autoencoders aprenen a descobrir els patrons subjacents en dades molt esparses. Tots aquests models s’implementen utilitzant dos marcs de programació: Keras i Tensorflow per a R. Es mostra empíricament que un enfocament híbrid que combina aquests models amb altres motors de filtratge proporciona una solució prometedora en comparació amb un recomanador que utilitza exclusivament Filtratge Col·laboratiu. D’altra banda, s’analitzen els sistemes de recomanació des d’un punt de vista econòmic, emfatitzant especialment el seu impacte en empreses de comerç electrònic. S’analitzen els sistemes de recomanació desenvolupats per quatre empreses pioneres del sector així com les tecnologies front-end en què s’implementen. En concret, s’analitza el seu ús en chatbots, programes informàtics de missatgeria instantània que, a través de la Intel·ligència Artificial simulen la conversa humana. Per tancar el projecte, es desenvolupa un chatbot propi implementat en una aplicació de missatgeria instantània i connectat a una empresa de comerç electrònic, capaç de donar recomanacions als usuaris fent ús del sistema de recomanació híbrid dut a terme

    Future Sight: Dynamic Story Generation with Large Pretrained Language Models

    Get PDF
    Automated story generation has been an open problem in computing for many decades. Only with the recent wave of deep learning research have neural networks been applied to automated story generation tasks. Current deep learning agents for automated story generation typically ingest a prompt or storyline on which to condition generated text. This approach lacks the dynamism to include elements of a story only decided by the model during inference. We build an interactive system using pretrained transformers finetuned on a novel objective to temporally interpolate between a story context c and a future plot event f. At inference time, users can suggest future plot events along with a distance, in sentences, to coerce a transformer decoder towards generating sentences that would both remain consistent with a story context and logically conclude with the future event. The results of our experiments demonstrate that there is a notion of adherence to both context and future in some, but not all, cases. We discuss in detail potential explanations as to why the model fails to condition on some contexts and futures with respect to the data and the parameters of our model. We include examples sampled from our model to motivate this discussion
    • …
    corecore