9 research outputs found
Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
The widespread use of GPS-enabled smartphones along with the popularity of
micro-blogging and social networking applications, e.g., Twitter and Facebook,
has resulted in the generation of huge streams of geo-tagged textual data. Many
applications require real-time processing of these streams. For example,
location-based e-coupon and ad-targeting systems enable advertisers to register
millions of ads to millions of users. The number of users is typically very
high and they are continuously moving, and the ads change frequently as well.
Hence sending the right ad to the matching users is very challenging. Existing
streaming systems are either centralized or are not spatial-keyword aware, and
cannot efficiently support the processing of rapidly arriving spatial-keyword
data streams. This paper presents Tornado, a distributed spatial-keyword stream
processing system. Tornado features routing units to fairly distribute the
workload, and furthermore, co-locate the data objects and the corresponding
queries at the same processing units. The routing units use the Augmented-Grid,
a novel structure that is equipped with an efficient search algorithm for
distributing the data objects and queries. Tornado uses evaluators to process
the data objects against the queries. The routing units minimize the redundant
communication by not sending data updates for processing when these updates do
not match any query. By applying dynamically evaluated cost formulae that
continuously represent the processing overhead at each evaluator, Tornado is
adaptive to changes in the workload. Extensive experimental evaluation using
spatio-textual range queries over real Twitter data indicates that Tornado
outperforms the non-spatio-textually aware approaches by up to two orders of
magnitude in terms of the overall system throughput
Attack-Resilient Adaptive Load-Balancing in Distributed Spatial Data Streaming Systems
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time with high-throughput and low response time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, the real-time streamed spatial data and query follow non-uniform spatial distributions that are continuously changing over time. Therefore, Distributed spatial streaming systems need to track the changes in the distribution of spatial data and queries and redistribute their workload accordingly. This thesis addresses the challenges of adapting to workload changes in distributed spatial streaming systems to improve the performance while preserving the system’s security. The thesis proposes TrioStat, an online workload estimation technique that relies on a probabilistic model for estimating the cost of partitions and machines of distributed spatial streaming systems. TrioStat has a decentralised technique to collect and maintain the required statistics in real-time with minimal overhead. In addition, this thesis introduces SWARM, a light-weight adaptive load-balancing protocol that continuously monitors the data and query workloads across the distributed processes of spatial data streaming systems, and redistribute the workloads soon as performance bottlenecks get detected. SWARM uses TrioStat to estimate the workload of the system’s machines. Although using adaptive load-balancing techniques significantly improves the performance of distributed streaming systems, they make the system vulnerable to attacks. In this thesis, we introduce a novel attack model that targets adaptive load-balancing mechanisms of distributed streaming systems. The attack reduces the throughput and the availability of the system by making it stay in a continuous state of rebalancing. The thesis proposes Guard, a component that detects and blocks attacks that target the adaptive load balancing of distributed streaming systems. Guard is deployed in SWARM to develop an attack-resilient adaptive load balancing mechanism for Distributed spatial streaming systems
A demonstration of Shahed: A MapReduce-based system for querying and visualizing satellite data
Abstract—Several space agencies such as NASA are continu-ously collecting datasets of earth dynamics—e.g., temperature, vegetation, and cloud coverage—through satellites. This data is stored in a publicly available archive for scientists and re-searchers and is very useful for studying climate, desertification, and land use change. The benefit of this data comes from its richness as it provides an archived history for over 15 years of satellite observations. Unfortunately, the use of such data is very limited due to the huge size of archives (> 500TB) and the limited capabilities of traditional applications. In this demo, we present Shahed, an interactive system which provides an efficient way to index, query, and visualize satellite datasets available in NASA archive. Shahed is composed of four main modules. The uncertainty module resolves data uncertainty imposed by the satellites. The indexing module organizes the data in a novel multi-resolution spatio-temporal index designed for satellite data. The querying module uses the indexes to answer both spatio-temporal selection and aggregate queries provided by the user. The visualization module generates images, videos, and multi-level images which gives an insight of data distribution and dynamics over time. This demo gives users a hands-on experience with Shahed through a map-based web interface in which users can browse the available datasets using the map, issue spatio-temporal queries, and visualize the results as images or videos. I
Tornado: A Distributed Spatio-Textual Stream Processing System *
ABSTRACT The widespread use of location-aware devices together with the increased popularity of micro-blogging applications (e.g., Twitter) led to the creation of large streams of spatio-textual data. In order to serve real-time applications, the processing of these large-scale spatio-textual streams needs to be distributed. However, existing distributed stream processing systems (e.g., Spark and Storm) are not optimized for spatial/textual content. In this demonstration, we introduce Tornado, a distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently process spatiotextual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm. The indexing layer is adaptive, i.e., dynamically re-distributes the processing across the system according to changes in the data distribution and/or query workload. In addition to keywords, higher-level textual concepts are identified and are semantically matched against spatio-textual queries. Tornado provides data deduplication and fusion to eliminate redundant textual data. We demonstrate a prototype of Tornado running against real Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted queryinterface