14 research outputs found

    Distinct random sampling from a distributed stream

    Get PDF
    We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling on distributed streams. We also present a lower bound on the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and detailed experimental results showing the performance of our algorithm on real-world data sets

    Geometric Monitoring in Action: a Systems Perspective for the Internet of Things

    Get PDF
    Applications for IoT often continuously monitor sensor values and react if the network-wide aggregate exceeds a threshold. Previous work on Geometric monitoring (GM) has promised a several-fold reduction in communication but been limited to analytic or high-level simulation results. In this paper, we build and evaluate a full system design for GM on resource-constrained devices. In particular, we provide an algorithmic implementation for commodity IoT hardware and a detailed study regarding duty cycle reduction and energy savings. Our results, both from full-system simulations and a publicly available testbed, show that GM indeed provides several-fold energy savings in communication. We see up to 3x and 11x reduction in duty-cycle when monitoring the variance and average temperature of a real-world data set, but the results fall short compared to the reduction in communication (4.3x and 44x, respectively). Hence, we investigate the energy overhead imposed by the network stack and the communication pattern of the algorithm and summarize our findings. These insights may enable the design of protocols that will unlock more of the potential of GM and similar algorithms for IoT deployments

    Continuous Monitoring meets Synchronous Transmissions and In-Network Aggregation

    Get PDF
    Continuously monitoring sensor readings is an important building block for many IoT applications. The literature offers resourceful methods that minimize the amount of communication required for continuous monitoring, where Geometric Monitoring (GM) is one of the most generally applicable ones. However, GM has unique communication requirements that require specialized network protocols to unlock the full potential of the algorithm. In this work, we show how application and protocol co-design can improve the real-life performance of GM, making it an application of practical value for real IoT deployments. We orchestrate the communication of GM to utilize the properties of a state-of-the-art wireless protocol (Crystal) that relies on synchronous transmissions and is designed for aperiodic traffic, as needed by GM. We bridge the existing gap between the capabilities of the protocol and the requirements of GM, especially in the case of periods of heavy communication. We do so by introducing an in-network aggregation technique relying on latent opportunities for aggregation that we exploit in Crystal\u27s design, allowing us to reliably monitor duplicate-sensitive aggregate functions, such as sum, average or variance. Our results from testbed experiments with a publicly available dataset show that the combination of GM and Crystal results in a very small duty-cycle, a 2.2x - 3.2x improvement compared to the baseline and up to 10x compared to previous work. We also show that our in-network aggregation technique reduces the duty-cycle by up to 1.38x

    Hardware-Aware Algorithm Designs for Efficient Parallel and Distributed Processing

    Get PDF
    The introduction and widespread adoption of the Internet of Things, together with emerging new industrial applications, bring new requirements in data processing. Specifically, the need for timely processing of data that arrives at high rates creates a challenge for the traditional cloud computing paradigm, where data collected at various sources is sent to the cloud for processing. As an approach to this challenge, processing algorithms and infrastructure are distributed from the cloud to multiple tiers of computing, closer to the sources of data. This creates a wide range of devices for algorithms to be deployed on and software designs to adapt to.In this thesis, we investigate how hardware-aware algorithm designs on a variety of platforms lead to algorithm implementations that efficiently utilize the underlying resources. We design, implement and evaluate new techniques for representative applications that involve the whole spectrum of devices, from resource-constrained sensors in the field, to highly parallel servers. At each tier of processing capability, we identify key architectural features that are relevant for applications and propose designs that make use of these features to achieve high-rate, timely and energy-efficient processing.In the first part of the thesis, we focus on high-end servers and utilize two main approaches to achieve high throughput processing: vectorization and thread parallelism. We employ vectorization for the case of pattern matching algorithms used in security applications. We show that re-thinking the design of algorithms to better utilize the resources available in the platforms they are deployed on, such as vector processing units, can bring significant speedups in processing throughout. We then show how thread-aware data distribution and proper inter-thread synchronization allow scalability, especially for the problem of high-rate network traffic monitoring. We design a parallelization scheme for sketch-based algorithms that summarize traffic information, which allows them to handle incoming data at high rates and be able to answer queries on that data efficiently, without overheads.In the second part of the thesis, we target the intermediate tier of computing devices and focus on the typical examples of hardware that is found there. We show how single-board computers with embedded accelerators can be used to handle the computationally heavy part of applications and showcase it specifically for pattern matching for security-related processing. We further identify key hardware features that affect the performance of pattern matching algorithms on such devices, present a co-evaluation framework to compare algorithms, and design a new algorithm that efficiently utilizes the hardware features.In the last part of the thesis, we shift the focus to the low-power, resource-constrained tier of processing devices. We target wireless sensor networks and study distributed data processing algorithms where the processing happens on the same devices that generate the data. Specifically, we focus on a continuous monitoring algorithm (geometric monitoring) that aims to minimize communication between nodes. By deploying that algorithm in action, under realistic environments, we demonstrate that the interplay between the network protocol and the application plays an important role in this layer of devices. Based on that observation, we co-design a continuous monitoring application with a modern network stack and augment it further with an in-network aggregation technique. In this way, we show that awareness of the underlying network stack is important to realize the full potential of the continuous monitoring algorithm.The techniques and solutions presented in this thesis contribute to better utilization of hardware characteristics, across a wide spectrum of platforms. We employ these techniques on problems that are representative examples of current and upcoming applications and contribute with an outlook of emerging possibilities that can build on the results of the thesis

    Topics in Data Stream Sampling and Insider Threat Detection

    Get PDF
    With the current explosion in the speed and volume of data, the conventional computation systems are not capable of dealing with large data efficiently. In this project, we do research in the data stream sampling methods and an application on insider threat detection. The goal of random sampling is to select a subset from the original population so that the subset can represent the whole population. In many real world applications, by sampling a subset from the original population, we can estimate the global statistical properties, such as mean, variance, probability distribution, etc. The goal of random sampling from a distributed stream is to select a subset from the union of the streams such that each element in the distributed stream is sampled with equal probability. In some cases, the “Heavy Hitters” dominate the random sample. The heavy hitters are the elements with high frequency. The distinct random sample can be applied so that the elements with low frequency can also be seen. Distinct sampling from a distributed stream is proposed to extract a subset from the unique set of the union of the distributed stream. In database query optimization, sampling unique subset from the population is an important task. Random sampling and distinct sampling are among the fundamental techniques and algorithms for large scale data analysis and the query enhancement over database systems. We propose algorithms, theoretical analysis, and experimental evaluations on random sampling and distinct sampling from a distributed stream. Nowadays, with more and more attacks on the computer systems, it is important to know how we protect our computer systems or classified information from hackers or attackers. Among all the attack or data breaches, more and more cases come from inside of an organization. It is called “Insider Threat.” In recent reports, malicious insiders are causing enormous damages in organizations. We propose two insider threat detection framework that monitors the system logs and detect anomaly behaviors. We propose a Scenario-based Insider Threat Detection method and a Session-based Insider Threat Detection. We implement our framework in Java, and present experimental evaluation on a synthetic dataset

    Doctor of Philosophy

    Get PDF
    dissertationIn the era of big data, many applications generate continuous online data from distributed locations, scattering devices, etc. Examples include data from social media, financial services, and sensor networks, etc. Meanwhile, large volumes of data can be archived or stored offline in distributed locations for further data analysis. Challenges from data uncertainty, large-scale data size, and distributed data sources motivate us to revisit several classic problems for both online and offline data explorations. The problem of continuous threshold monitoring for distributed data is commonly encountered in many real-world applications. We study this problem for distributed probabilistic data. We show how to prune expensive threshold queries using various tail bounds and combine tail-bound techniques with adaptive algorithms for monitoring distributed deterministic data. We also show how to approximate threshold queries based on sampling techniques. Threshold monitoring problems can only tell a monitoring function is above or below a threshold constraint but not how far away from it. This motivates us to study the problem of continuous tracking functions over distributed data. We first investigate the tracking problem on a chain topology. Then we show how to solve tracking problems on a distributed setting using solutions for the chain model. We studied online tracking of the max function on ""broom"" tree and general tree topologies in this work. Finally, we examine building scalable histograms for distributed probabilistic data. We show how to build approximate histograms based on a partition-and-merge principle on a centralized machine. Then, we show how to extend our solutions to distributed and parallel settings to further mitigate scalability bottlenecks and deal with distributed data
    corecore