Search CORE

3,109 research outputs found

Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster

Author: Aly Ahmed M.
Aref Walid G.
Basalamah Saleh
Daghistani Anas
Mahmood Ahmed R.
Prabhakar Sunil
Tang Mingjie
Publication venue
Publication date: 08/09/2017
Field of study

The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput

arXiv.org e-Print Archive

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis

Author: Gamallo Pablo
Martínez-Castaño Rodrigo
Pichel Juan C.
Publication venue
Publication date: 11/01/2018
Field of study

In this paper we propose a new parallel architecture based on Big Data technologies for real-time sentiment analysis on microblogging posts. Polypus is a modular framework that provides the following functionalities: (1) massive text extraction from Twitter, (2) distributed non-relational storage optimized for time range queries, (3) memory-based intermodule buffering, (4) real-time sentiment classification, (5) near real-time keyword sentiment aggregation in time series, (6) a HTTP API to interact with the Polypus cluster and (7) a web interface to analyze results visually. The whole architecture is self-deployable and based on Docker containers

arXiv.org e-Print Archive

StreetX: Spatio-Temporal Access Control Model for Data

Author: Sandha Sandeep Singh
Publication venue
Publication date: 10/11/2017
Field of study

Cities are a big source of spatio-temporal data that is shared across entities to drive potential use cases. Many of the Spatio-temporal datasets are confidential and are selectively shared. To allow selective sharing, several access control models exist, however user cannot express arbitrary space and time constraints on data attributes using them. In this paper we focus on spatio-temporal access control model. We show that location and time attributes of data may decide its confidentiality via a motivating example and thus can affect user's access control policy. In this paper, we present StreetX which enables user to represent constraints on multiple arbitrary space regions and time windows using a simple abstract language. StreetX is scalable and is designed to handle large amount of spatio-temporal data from multiple users. Multiple space and time constraints can affect performance of the query and may also result in conflicts. StreetX automatically resolve conflicts and optimizes the query evaluation with access control to improve performance. We implemented and tested prototype of StreetX using space constraints by defining region having 1749 polygon coordinates on 10 million data records. Our testing shows that StreetX extends the current access control with spatio-temporal capabilities.Comment: 10 page

arXiv.org e-Print Archive

Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce

Author: Li Jianxin
Liu Chengfei
Yao Liang
Yu Jeffrey Xu
Zhou Rui
Publication venue
Publication date: 10/01/2013
Field of study

In this paper we study how to efficiently compute \textit{frequent co-occurring terms} (FCT) in the results of a keyword query in parallel using the popular MapReduce framework. Taking as input a keyword query q and an integer k, an FCT query reports the k terms that are not in q, but appear most frequently in the results of the keyword query q over multiple joined relations. The returned terms of FCT search can be used to do query expansion and query refinement for traditional keyword search. Different from the method of FCT search in a single platform, our proposed approach can efficiently answer a FCT query using the MapReduce Paradigm without pre-computing the results of the original keyword query, which is run in parallel platform. In this work, we can output the final FCT search results by two MapReduce jobs: the first is to extract the statistical information of the data; and the second is to calculate the total frequency of each term based on the output of the first job. At the two MapReduce jobs, we would guarantee the load balance of mappers and the computational balance of reducers as much as possible. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes

arXiv.org e-Print Archive

StreamWorks - A system for Dynamic Graph Search

Author: Beus Sherman
Chin George
Choudhury Sutanay
Feo John
Holder Lawrence
Ray Abhik
Publication venue
Publication date: 11/06/2013
Field of study

Acting on time-critical events by processing ever growing social media, news or cyber data streams is a major technical challenge. Many of these data sources can be modeled as multi-relational graphs. Mining and searching for subgraph patterns in a continuous setting requires an efficient approach to incremental graph search. The goal of our work is to enable real-time search capabilities for graph databases. This demonstration will present a dynamic graph query system that leverages the structural and semantic characteristics of the underlying multi-relational graph.Comment: SIGMOD 2013: International Conference on Management of Dat

arXiv.org e-Print Archive

ODYS: A Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS

Author: Kim In-Joong
Kwon Hyuk-Yoon
Song Il-Yeol
Whang Kyu-Young
Yeo Yeon-Mi
Yun Tae-Seob
Publication venue
Publication date: 21/08/2012
Field of study

Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS and demonstrate its commercial-level scalability and performance. In addition, we present a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We have built a five-node parallel search engine according to the proposed architecture using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show the correctness of the model by comparing the projected output with the experimental results of the five-node engine. Our model demonstrates that ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion web pages by using only 43,472 nodes with an average query response time of 211 ms, which is equivalent to or better than those of commercial search engines. We also show that, by using twice as many (86,944) nodes, ODYS can provide an average query response time of 162 ms, which is significantly lower than those of commercial search engines.Comment: 34 pages, 13 figure

arXiv.org e-Print Archive

Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Author: Fu Gengshen
Mandal Arindam
Matsoukas Spyros
Panchapagesan Sankaran
Raju Anirudh
Strom Nikko
Sun Ming
Tucker George
Vitaladevuni Shiv
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/05/2017
Field of study

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields

67.6\%

relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure

arXiv.org e-Print Archive

From Tweets to Events: Exploring a Scalable Solution for Twitter Streams

Author: Kumar Shamanth
Liu Huan
Mehta Sameep
Subramaniam L. Venkata
Publication venue
Publication date: 06/05/2014
Field of study

The unprecedented use of social media through smartphones and other web-enabled mobile devices has enabled the rapid adoption of platforms like Twitter. Event detection has found many applications on the web, including breaking news identification and summarization. The recent increase in the usage of Twitter during crises has attracted researchers to focus on detecting events in tweets. However, current solutions have focused on static Twitter data. The necessity to detect events in a streaming environment during fast paced events such as a crisis presents new opportunities and challenges. In this paper, we investigate event detection in the context of real-time Twitter streams as observed in real-world crises. We highlight the key challenges in this problem: the informal nature of text, and the high volume and high velocity characteristics of Twitter streams. We present a novel approach to address these challenges using single-pass clustering and the compression distance to efficiently detect events in Twitter streams. Through experiments on large Twitter datasets, we demonstrate that the proposed framework is able to detect events in near real-time and can scale to large and noisy Twitter streams

arXiv.org e-Print Archive

S3BD: Secure Semantic Search over Encrypted Big Data in the Cloud

Author: Salehi Mohsen Amini
Woodworth Jason
Publication venue
Publication date: 20/09/2018
Field of study

Cloud storage is a widely utilized service for both personal and enterprise demands. However, despite its advantages, many potential users with enormous amounts of sensitive data (big data) refrain from fully utilizing the cloud storage service due to valid concerns about data privacy. An established solution to the cloud data privacy problem is to perform encryption on the client-end. This approach, however, restricts data processing capabilities (eg, searching over the data). Accordingly, the research problem we investigate is how to enable real-time searching over the encrypted big data in the cloud. In particular, semantic search is of interest to clients dealing with big data. To address this problem, in this research, we develop a system (termed S3BD) for searching big data using cloud services without exposing any data to cloud providers. To keep real-time response on big data, S3BD proactively prunes the search space to a subset of the whole dataset. For that purpose, we propose a method to cluster the encrypted data. An abstract of each cluster is maintained on the client-end to navigate the search operation to appropriate clusters at the search time. Results of experiments, carried out on real-world big datasets, demonstrate that the search operation can be achieved in real-time and is significantly more efficient than other counterparts. In addition, a fully functional prototype of S3BD is made publicly available

arXiv.org e-Print Archive

Processing Tweets for Cybersecurity Threat Awareness

Author: Alves Fernando
Bessani Alysson
Bettini Aurélien
Ferreira Pedro M.
Publication venue
Publication date: 03/04/2019
Field of study

Receiving timely and relevant security information is crucial for maintaining a high-security level on an IT infrastructure. This information can be extracted from Open Source Intelligence published daily by users, security organisations, and researchers. In particular, Twitter has become an information hub for obtaining cutting-edge information about many subjects, including cybersecurity. This work proposes SYNAPSE, a Twitter-based streaming threat monitor that generates a continuously updated summary of the threat landscape related to a monitored infrastructure. Its tweet-processing pipeline is composed of filtering, feature extraction, binary classification, an innovative clustering strategy, and generation of Indicators of Compromise (IoCs). A quantitative evaluation considering all tweets from 80 accounts over more than 8 months (over 195.000 tweets), shows that our approach timely and successfully finds the majority of security-related tweets concerning an example IT infrastructure (true positive rate above 90%), incorrectly selects a small number of tweets as relevant (false positive rate under 10%), and summarises the results to very few IoCs per day. A qualitative evaluation of the IoCs generated by SYNAPSE demonstrates their relevance (based on the CVSS score and the availability of patches or exploits), and timeliness (based on threat disclosure dates from NVD)

arXiv.org e-Print Archive