69 research outputs found

    Sliding windows over uncertain data streams

    Get PDF
    Uncertain data streams can have tuples with both value and existential uncertainty. A tuple has value uncertainty when it can assume multiple possible values. A tuple is existentially uncertain when the sum of the probabilities of its possible values is <<<1. A situation where existential uncertainty can arise is when applying relational operators to streams with value uncertainty. Several prior works have focused on querying and mining data streams with both value and existential uncertainty. However, none of them have studied, in depth, the implications of existential uncertainty on sliding window processing, even though it naturally arises when processing uncertain data. In this work, we study the challenges arising from existential uncertainty, more specifically the management of count-based sliding windows, which are a basic building block of stream processing applications. We extend the semantics of sliding window to define the novel concept of uncertain sliding windows and provide both exact and approximate algorithms for managing windows under existential uncertainty. We also show how current state-of-the-art techniques for answering similarity join queries can be easily adapted to be used with uncertain sliding windows. We evaluate our proposed techniques under a variety of configurations using real data. The results show that the algorithms used to maintain uncertain sliding windows can efficiently operate while providing a high-quality approximation in query answering. In addition, we show that sort-based similarity join algorithms can perform better than index-based techniques (on 17 real datasets) when the number of possible values per tuple is low, as in many real-world applications. © 2014, Springer-Verlag London

    Designing Probabilistic Flow Counting over Sliding Windows

    Get PDF
    Probabilistic approaches allow designing very efficient data structures and algorithms aimed at computing the number of flows within a given observation window. The practical applications are many, ranging from security to network monitoring and control. We focus our investigation on approaches tailored for sliding windows, that enable continous-time measurements independently from the observation window. In particular, we show how to extend standard approaches, such as Probabilistic Counting with Stochastic Averaging (PCSA), to count over an observation window. The main idea is to modify the data structure to store a compact representation of the timestamp in the registers and to modify coherently the related algorithms. We propose a timestamp-augmented version of PCSA, denoted as TS-PCSA, and compare it with state-of-the-art solutions based on Hyper-LogLog (HLL) counters that evaluate the cardinality over a sliding window, but without storing the timestamps. We will show that TS-PCSA with a limited memory footprint is achieving a different tradeoff between memory and accuracy with respect to HLL-based solutions

    In support of workload-aware streaming state management

    Full text link
    Modern distributed stream processors predominantly rely on LSM-based key-value stores to manage the state of long-running computations. We question the suitability of such general-purpose stores for streaming workloads and argue that they incur unnecessary overheads in exchange for state management capabilities. Since streaming operators are instantiated once and are long-running, state types, sizes, and access patterns, can either be inferred at compile time or learned during execution. This paper surfaces the limitations of established practices for streaming state management and advocates for configurable streaming backends, tailored to the state requirements of each operator. Using workload-aware state management, we achieve an order of magnitude improvement in p99 latency and 2x higher throughput.https://www.usenix.org/system/files/hotstorage20_paper_kalavri.pdfPublished versio

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Temporal Sentence Grounding in Videos: A Survey and Future Directions

    Full text link
    Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table

    BALANCING PRIVACY, PRECISION AND PERFORMANCE IN DISTRIBUTED SYSTEMS

    Get PDF
    Privacy, Precision, and Performance (3Ps) are three fundamental design objectives in distributed systems. However, these properties tend to compete with one another and are not considered absolute properties or functions. They must be defined and justified in terms of a system, its resources, stakeholder concerns, and the security threat model. To date, distributed systems research has only considered the trade-offs of balancing privacy, precision, and performance in a pairwise fashion. However, this dissertation formally explores the space of trade-offs among all 3Ps by examining three representative classes of distributed systems, namely Wireless Sensor Networks (WSNs), cloud systems, and Data Stream Management Systems (DSMSs). These representative systems support large part of the modern and mission-critical distributed systems. WSNs are real-time systems characterized by unreliable network interconnections and highly constrained computational and power resources. The dissertation proposes a privacy-preserving in-network aggregation protocol for WSNs demonstrating that the 3Ps could be navigated by adopting the appropriate algorithms and cryptographic techniques that are not prohibitively expensive. Next, the dissertation highlights the privacy and precision issues that arise in cloud databases due to the eventual consistency models of the cloud. To address these issues, consistency enforcement techniques across cloud servers are proposed and the trade-offs between 3Ps are discussed to help guide cloud database users on how to balance these properties. Lastly, the 3Ps properties are examined in DSMSs which are characterized by high volumes of unbounded input data streams and strict real-time processing constraints. Within this system, the 3Ps are balanced through a proposed simple and efficient technique that applies access control policies over shared operator networks to achieve privacy and precision without sacrificing the systems performance. Despite that in this dissertation, it was shown that, with the right set of protocols and algorithms, the desirable 3P properties can co-exist in a balanced way in well-established distributed systems, this dissertation is promoting the use of the new 3Ps-by-design concept. This concept is meant to encourage distributed systems designers to proactively consider the interplay among the 3Ps from the initial stages of the systems design lifecycle rather than identifying them as add-on properties to systems

    Distributed Time Series Analytics

    Get PDF
    In recent years time series data has become ubiquitous thanks to affordable sensors and advances in embedded technology. Large amount of time-series data are continuously produced in a wide spectrum of applications, such as sensor networks, medical monitoring and so on. Availability of such large scale time series data highlights the importance of of scalable data management, efï¬cient querying and analysis. Meanwhile, in the online setting time series carries invaluable information and knowledge about the real-time status of involved entities or monitored phenomena, which calls for online time series data mining for serving timely decision making or event detection. In this thesis we aim to address these important issues pertaining to scalable and distributed analytics techniques for massive time series data. Concretely, this thesis is centered around the following three topics: As the number of sensors that pervade our lives signiï¬cantly increases (e.g., environmental sensors, mobile phone sensors, IoT applications, etc.), the efï¬cient management of massive amount of time series from such sensors is becoming increasingly important. The inï¬nite nature of sensor data poses a serious challenge for query processing even in a cloud infrastructure. Traditional raw sensor data management systems based on relational databases lack scalability to accommodate large scale sensor data efï¬ciently. Thus, distributed key-value stores in the cloud are becoming a prime tool to manage sensor data. However, currently there are no techniques for indexing and/or query optimization of the model-view sensor time series data in the cloud. In Chapter 2, we propose an innovative index for modeled segments in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. The dramatic increase in the availability of data streams fuels the development of many distributed real-time computation engines (e.g., Storm, Samza, Spark Streaming, S4 etc.). In Chapter 3, we focus on a fundamental time series mining task in such a new computation paradigm, namely continuously mining dynamic (lagged) correlations in time series via a distributed real-time computation engine. Correlations reveal the hidden and temporal interactions across time series and are widely used in scientiï¬c data analysis, data-driven event detection, ï¬nance markets and so on. We propose the P2H framework consisting of a parallelism-partitioning based data shufï¬ing and a hypercube structure based computation pruning method, so as to enhance both the communication and computation efï¬ciency for mining correlations in the distributed context. In numerous real-world applications large datasets collected from observations and measurements of physical entities are inevitably noisy and contain outliers. The outliers in such large and noisy datasets can dramatically degrade the performance of standard distributed machine learning approaches such as s regression trees. In Chapter 4 we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy datasets. Then we present an adaptive gradient learning method for recurrent neural networks (RNN) to forecast streaming time series in the presence of both outliers and change points

    Wearable Sensor Gait Analysis for Fall Detection Using Deep Learning Methods

    Get PDF
    World Health Organization (WHO) data show that around 684,000 people die from falls yearly, making it the second-highest mortality rate after traffic accidents [1]. Early detection of falls, followed by pneumatic protection, is one of the most effective means of ensuring the safety of the elderly. In light of the recent widespread adoption of wearable sensors, it has become increasingly critical that fall detection models are developed that can effectively process large and sequential sensor signal data. Several researchers have recently developed fall detection algorithms based on wearable sensor data. However, real-time fall detection remains challenging because of the wide range of gait variations in older. Choosing the appropriate sensor and placing it in the most suitable location are essential components of a robust real-time fall detection system. This dissertation implements various detection models to analyze and mitigate injuries due to falls in the senior community. It presents different methods for detecting falls in real-time using deep learning networks. Several sliding window segmentation techniques are developed and compared in the first study. As a next step, various methods are implemented and applied to prevent sampling imbalances caused by the real-world collection of fall data. A study is also conducted to determine whether accelerometers and gyroscopes can distinguish between falls and near-falls. According to the literature survey, machine learning algorithms produce varying degrees of accuracy when applied to various datasets. The algorithm’s performance depends on several factors, including the type and location of the sensors, the fall pattern, the dataset’s characteristics, and the methods used for preprocessing and sliding window segmentation. Other challenges associated with fall detection include the need for centralized datasets for comparing the results of different algorithms. This dissertation compares the performance of varying fall detection methods using deep learning algorithms across multiple data sets. Furthermore, deep learning has been explored in the second application of the ECG-based virtual pathology stethoscope detection system. A novel real-time virtual pathology stethoscope (VPS) detection method has been developed. Several deep-learning methods are evaluated for classifying the location of the stethoscope by taking advantage of subtle differences in the ECG signals. This study would significantly extend the simulation capabilities of standard patients by allowing medical students and trainees to perform realistic cardiac auscultation and hear cardiac auscultation in a clinical environment

    Leveraging Watermarks to Improve Performance of Streaming Systems

    Get PDF
    Modern stream processing engines (SPEs) process large volumes of events propagated at high velocity through multiple queries. By continuously receiving watermarks, which are marker events injected into the stream to signify that no further events are expected beyond a designated timestamp, SPEs can infer stream progress to correctly process window operators. While stream progress is useful information for query execution, it is only utilized to ensure input completion. We argue that to improve performance, stream progress should be leveraged in the design of SPE subsystems. In this thesis, we demonstrate the significant advantages of leveraging stream progress to solve two important SPE problems: query scheduling, and query sample processing. First, existing SPE schedulers generally aim to minimize query output latency by minimizing, in turn, the mean propagation delay of events in query pipelines. However, for queries containing commonly used blocking operators such as windows, we show that a superior approach would be to prioritize the queries based on stream progress. Through the design and development of Klink, we leverage stream progress to unblock window operators and to rapidly propagate the events to output operators. Secondly, sample query processing limits input to only a subset of events such that the sample is statistically representative of the input while ensuring output accuracy guarantees. However, output latency can be significantly increased because relevant watermarks can suffer from large ingestion delay due to long or bursty network latencies. Window computations that account for stragglers can add significant latency while providing inconsequential accuracy improvement. We propose Aion, an algorithm that utilizes sampling to provide approximate answers with low latency by minimizing the effect of stragglers through leveraging control over stream progress. We integrate Klink and Aion into the popular open-source SPE Apache Flink. We demonstrate that Klink delivers hefty performance gains on benchmark workloads, reducing mean and tail query latencies by up to 60% over existing scheduling policies. Similarly, using different benchmark workloads, we demonstrate that Aion reduces stream output latency by up to 85% while providing 95% accuracy guarantees
    corecore