9 research outputs found

    Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster

    Full text link
    The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput

    Attack-Resilient Adaptive Load-Balancing in Distributed Spatial Data Streaming Systems

    No full text
    The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time with high-throughput and low response time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, the real-time streamed spatial data and query follow non-uniform spatial distributions that are continuously changing over time. Therefore, Distributed spatial streaming systems need to track the changes in the distribution of spatial data and queries and redistribute their workload accordingly. This thesis addresses the challenges of adapting to workload changes in distributed spatial streaming systems to improve the performance while preserving the system’s security. The thesis proposes TrioStat, an online workload estimation technique that relies on a probabilistic model for estimating the cost of partitions and machines of distributed spatial streaming systems. TrioStat has a decentralised technique to collect and maintain the required statistics in real-time with minimal overhead. In addition, this thesis introduces SWARM, a light-weight adaptive load-balancing protocol that continuously monitors the data and query workloads across the distributed processes of spatial data streaming systems, and redistribute the workloads soon as performance bottlenecks get detected. SWARM uses TrioStat to estimate the workload of the system’s machines. Although using adaptive load-balancing techniques significantly improves the performance of distributed streaming systems, they make the system vulnerable to attacks. In this thesis, we introduce a novel attack model that targets adaptive load-balancing mechanisms of distributed streaming systems. The attack reduces the throughput and the availability of the system by making it stay in a continuous state of rebalancing. The thesis proposes Guard, a component that detects and blocks attacks that target the adaptive load balancing of distributed streaming systems. Guard is deployed in SWARM to develop an attack-resilient adaptive load balancing mechanism for Distributed spatial streaming systems

    A demonstration of Shahed: A MapReduce-based system for querying and visualizing satellite data

    No full text
    Abstract—Several space agencies such as NASA are continu-ously collecting datasets of earth dynamics—e.g., temperature, vegetation, and cloud coverage—through satellites. This data is stored in a publicly available archive for scientists and re-searchers and is very useful for studying climate, desertification, and land use change. The benefit of this data comes from its richness as it provides an archived history for over 15 years of satellite observations. Unfortunately, the use of such data is very limited due to the huge size of archives (> 500TB) and the limited capabilities of traditional applications. In this demo, we present Shahed, an interactive system which provides an efficient way to index, query, and visualize satellite datasets available in NASA archive. Shahed is composed of four main modules. The uncertainty module resolves data uncertainty imposed by the satellites. The indexing module organizes the data in a novel multi-resolution spatio-temporal index designed for satellite data. The querying module uses the indexes to answer both spatio-temporal selection and aggregate queries provided by the user. The visualization module generates images, videos, and multi-level images which gives an insight of data distribution and dynamics over time. This demo gives users a hands-on experience with Shahed through a map-based web interface in which users can browse the available datasets using the map, issue spatio-temporal queries, and visualize the results as images or videos. I

    Tornado: A Distributed Spatio-Textual Stream Processing System *

    No full text
    ABSTRACT The widespread use of location-aware devices together with the increased popularity of micro-blogging applications (e.g., Twitter) led to the creation of large streams of spatio-textual data. In order to serve real-time applications, the processing of these large-scale spatio-textual streams needs to be distributed. However, existing distributed stream processing systems (e.g., Spark and Storm) are not optimized for spatial/textual content. In this demonstration, we introduce Tornado, a distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently process spatiotextual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm. The indexing layer is adaptive, i.e., dynamically re-distributes the processing across the system according to changes in the data distribution and/or query workload. In addition to keywords, higher-level textual concepts are identified and are semantically matched against spatio-textual queries. Tornado provides data deduplication and fusion to eliminate redundant textual data. We demonstrate a prototype of Tornado running against real Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted queryinterface