3,109 research outputs found
Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
The widespread use of GPS-enabled smartphones along with the popularity of
micro-blogging and social networking applications, e.g., Twitter and Facebook,
has resulted in the generation of huge streams of geo-tagged textual data. Many
applications require real-time processing of these streams. For example,
location-based e-coupon and ad-targeting systems enable advertisers to register
millions of ads to millions of users. The number of users is typically very
high and they are continuously moving, and the ads change frequently as well.
Hence sending the right ad to the matching users is very challenging. Existing
streaming systems are either centralized or are not spatial-keyword aware, and
cannot efficiently support the processing of rapidly arriving spatial-keyword
data streams. This paper presents Tornado, a distributed spatial-keyword stream
processing system. Tornado features routing units to fairly distribute the
workload, and furthermore, co-locate the data objects and the corresponding
queries at the same processing units. The routing units use the Augmented-Grid,
a novel structure that is equipped with an efficient search algorithm for
distributing the data objects and queries. Tornado uses evaluators to process
the data objects against the queries. The routing units minimize the redundant
communication by not sending data updates for processing when these updates do
not match any query. By applying dynamically evaluated cost formulae that
continuously represent the processing overhead at each evaluator, Tornado is
adaptive to changes in the workload. Extensive experimental evaluation using
spatio-textual range queries over real Twitter data indicates that Tornado
outperforms the non-spatio-textually aware approaches by up to two orders of
magnitude in terms of the overall system throughput
Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis
In this paper we propose a new parallel architecture based on Big Data
technologies for real-time sentiment analysis on microblogging posts. Polypus
is a modular framework that provides the following functionalities: (1) massive
text extraction from Twitter, (2) distributed non-relational storage optimized
for time range queries, (3) memory-based intermodule buffering, (4) real-time
sentiment classification, (5) near real-time keyword sentiment aggregation in
time series, (6) a HTTP API to interact with the Polypus cluster and (7) a web
interface to analyze results visually. The whole architecture is
self-deployable and based on Docker containers
StreetX: Spatio-Temporal Access Control Model for Data
Cities are a big source of spatio-temporal data that is shared across
entities to drive potential use cases. Many of the Spatio-temporal datasets are
confidential and are selectively shared. To allow selective sharing, several
access control models exist, however user cannot express arbitrary space and
time constraints on data attributes using them. In this paper we focus on
spatio-temporal access control model. We show that location and time attributes
of data may decide its confidentiality via a motivating example and thus can
affect user's access control policy. In this paper, we present StreetX which
enables user to represent constraints on multiple arbitrary space regions and
time windows using a simple abstract language. StreetX is scalable and is
designed to handle large amount of spatio-temporal data from multiple users.
Multiple space and time constraints can affect performance of the query and may
also result in conflicts. StreetX automatically resolve conflicts and optimizes
the query evaluation with access control to improve performance. We implemented
and tested prototype of StreetX using space constraints by defining region
having 1749 polygon coordinates on 10 million data records. Our testing shows
that StreetX extends the current access control with spatio-temporal
capabilities.Comment: 10 page
Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce
In this paper we study how to efficiently compute \textit{frequent
co-occurring terms} (FCT) in the results of a keyword query in parallel using
the popular MapReduce framework. Taking as input a keyword query q and an
integer k, an FCT query reports the k terms that are not in q, but appear most
frequently in the results of the keyword query q over multiple joined
relations. The returned terms of FCT search can be used to do query expansion
and query refinement for traditional keyword search. Different from the method
of FCT search in a single platform, our proposed approach can efficiently
answer a FCT query using the MapReduce Paradigm without pre-computing the
results of the original keyword query, which is run in parallel platform. In
this work, we can output the final FCT search results by two MapReduce jobs:
the first is to extract the statistical information of the data; and the second
is to calculate the total frequency of each term based on the output of the
first job. At the two MapReduce jobs, we would guarantee the load balance of
mappers and the computational balance of reducers as much as possible.
Analytical and experimental evaluations demonstrate the efficiency and
scalability of our proposed approach using TPC-H benchmark datasets with
different sizes
StreamWorks - A system for Dynamic Graph Search
Acting on time-critical events by processing ever growing social media, news
or cyber data streams is a major technical challenge. Many of these data
sources can be modeled as multi-relational graphs. Mining and searching for
subgraph patterns in a continuous setting requires an efficient approach to
incremental graph search. The goal of our work is to enable real-time search
capabilities for graph databases. This demonstration will present a dynamic
graph query system that leverages the structural and semantic characteristics
of the underlying multi-relational graph.Comment: SIGMOD 2013: International Conference on Management of Dat
ODYS: A Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS
Recently, parallel search engines have been implemented based on scalable
distributed file systems such as Google File System. However, we claim that
building a massively-parallel search engine using a parallel DBMS can be an
attractive alternative since it supports a higher-level (i.e., SQL-level)
interface than that of a distributed file system for easy and less error-prone
application development while providing scalability. In this paper, we propose
a new approach of building a massively-parallel search engine using a DB-IR
tightly-integrated parallel DBMS and demonstrate its commercial-level
scalability and performance. In addition, we present a hybrid (i.e., analytic
and experimental) performance model for the parallel search engine. We have
built a five-node parallel search engine according to the proposed architecture
using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show
the correctness of the model by comparing the projected output with the
experimental results of the five-node engine. Our model demonstrates that ODYS
is capable of handling 1 billion queries per day (81 queries/sec) for 30
billion web pages by using only 43,472 nodes with an average query response
time of 211 ms, which is equivalent to or better than those of commercial
search engines. We also show that, by using twice as many (86,944) nodes, ODYS
can provide an average query response time of 162 ms, which is significantly
lower than those of commercial search engines.Comment: 34 pages, 13 figure
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
We propose a max-pooling based loss function for training Long Short-Term
Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low
CPU, memory, and latency requirements. The max-pooling loss training can be
further guided by initializing with a cross-entropy loss trained network. A
posterior smoothing based evaluation approach is employed to measure keyword
spotting performance. Our experimental results show that LSTM models trained
using cross-entropy loss or max-pooling loss outperform a cross-entropy loss
trained baseline feed-forward Deep Neural Network (DNN). In addition,
max-pooling loss trained LSTM with randomly initialized network performs better
compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss
trained LSTM initialized with a cross-entropy pre-trained network shows the
best performance, which yields relative reduction compared to baseline
feed-forward DNN in Area Under the Curve (AUC) measure
From Tweets to Events: Exploring a Scalable Solution for Twitter Streams
The unprecedented use of social media through smartphones and other
web-enabled mobile devices has enabled the rapid adoption of platforms like
Twitter. Event detection has found many applications on the web, including
breaking news identification and summarization. The recent increase in the
usage of Twitter during crises has attracted researchers to focus on detecting
events in tweets. However, current solutions have focused on static Twitter
data. The necessity to detect events in a streaming environment during fast
paced events such as a crisis presents new opportunities and challenges. In
this paper, we investigate event detection in the context of real-time Twitter
streams as observed in real-world crises. We highlight the key challenges in
this problem: the informal nature of text, and the high volume and high
velocity characteristics of Twitter streams. We present a novel approach to
address these challenges using single-pass clustering and the compression
distance to efficiently detect events in Twitter streams. Through experiments
on large Twitter datasets, we demonstrate that the proposed framework is able
to detect events in near real-time and can scale to large and noisy Twitter
streams
S3BD: Secure Semantic Search over Encrypted Big Data in the Cloud
Cloud storage is a widely utilized service for both personal and enterprise
demands. However, despite its advantages, many potential users with enormous
amounts of sensitive data (big data) refrain from fully utilizing the cloud
storage service due to valid concerns about data privacy. An established
solution to the cloud data privacy problem is to perform encryption on the
client-end. This approach, however, restricts data processing capabilities (eg,
searching over the data). Accordingly, the research problem we investigate is
how to enable real-time searching over the encrypted big data in the cloud. In
particular, semantic search is of interest to clients dealing with big data. To
address this problem, in this research, we develop a system (termed S3BD) for
searching big data using cloud services without exposing any data to cloud
providers. To keep real-time response on big data, S3BD proactively prunes the
search space to a subset of the whole dataset. For that purpose, we propose a
method to cluster the encrypted data. An abstract of each cluster is maintained
on the client-end to navigate the search operation to appropriate clusters at
the search time. Results of experiments, carried out on real-world big
datasets, demonstrate that the search operation can be achieved in real-time
and is significantly more efficient than other counterparts. In addition, a
fully functional prototype of S3BD is made publicly available
Processing Tweets for Cybersecurity Threat Awareness
Receiving timely and relevant security information is crucial for maintaining
a high-security level on an IT infrastructure. This information can be
extracted from Open Source Intelligence published daily by users, security
organisations, and researchers. In particular, Twitter has become an
information hub for obtaining cutting-edge information about many subjects,
including cybersecurity. This work proposes SYNAPSE, a Twitter-based streaming
threat monitor that generates a continuously updated summary of the threat
landscape related to a monitored infrastructure. Its tweet-processing pipeline
is composed of filtering, feature extraction, binary classification, an
innovative clustering strategy, and generation of Indicators of Compromise
(IoCs). A quantitative evaluation considering all tweets from 80 accounts over
more than 8 months (over 195.000 tweets), shows that our approach timely and
successfully finds the majority of security-related tweets concerning an
example IT infrastructure (true positive rate above 90%), incorrectly selects a
small number of tweets as relevant (false positive rate under 10%), and
summarises the results to very few IoCs per day. A qualitative evaluation of
the IoCs generated by SYNAPSE demonstrates their relevance (based on the CVSS
score and the availability of patches or exploits), and timeliness (based on
threat disclosure dates from NVD)
- …