171 research outputs found
Knowledge-infused and Consistent Complex Event Processing over Real-time and Persistent Streams
Emerging applications in Internet of Things (IoT) and Cyber-Physical Systems
(CPS) present novel challenges to Big Data platforms for performing online
analytics. Ubiquitous sensors from IoT deployments are able to generate data
streams at high velocity, that include information from a variety of domains,
and accumulate to large volumes on disk. Complex Event Processing (CEP) is
recognized as an important real-time computing paradigm for analyzing
continuous data streams. However, existing work on CEP is largely limited to
relational query processing, exposing two distinctive gaps for query
specification and execution: (1) infusing the relational query model with
higher level knowledge semantics, and (2) seamless query evaluation across
temporal spaces that span past, present and future events. These allow
accessible analytics over data streams having properties from different
disciplines, and help span the velocity (real-time) and volume (persistent)
dimensions. In this article, we introduce a Knowledge-infused CEP (X-CEP)
framework that provides domain-aware knowledge query constructs along with
temporal operators that allow end-to-end queries to span across real-time and
persistent streams. We translate this query model to efficient query execution
over online and offline data streams, proposing several optimizations to
mitigate the overheads introduced by evaluating semantic predicates and in
accessing high-volume historic data streams. The proposed X-CEP query model and
execution approaches are implemented in our prototype semantic CEP engine,
SCEPter. We validate our query model using domain-aware CEP queries from a
real-world Smart Power Grid application, and experimentally analyze the
benefits of our optimizations for executing these queries, using event streams
from a campus-microgrid IoT deployment.Comment: 34 pages, 16 figures, accepted in Future Generation Computer Systems,
October 27, 201
Manycore processing of repeated range queries over massive moving objects observations
The ability to timely process significant amounts of continuously updated
spatial data is mandatory for an increasing number of applications. Parallelism
enables such applications to face this data-intensive challenge and allows the
devised systems to feature low latency and high scalability. In this paper we
focus on a specific data-intensive problem, concerning the repeated processing
of huge amounts of range queries over massive sets of moving objects, where the
spatial extents of queries and objects are continuously modified over time. To
tackle this problem and significantly accelerate query processing we devise a
hybrid CPU/GPU pipeline that compresses data output and save query processing
work. The devised system relies on an ad-hoc spatial index leading to a problem
decomposition that results in a set of independent data-parallel tasks. The
index is based on a point-region quadtree space decomposition and allows to
tackle effectively a broad range of spatial object distributions, even those
very skewed. Also, to deal with the architectural peculiarities and limitations
of the GPUs, we adopt non-trivial GPU data structures that avoid the need of
locked memory accesses and favour coalesced memory accesses, thus enhancing the
overall memory throughput. To the best of our knowledge this is the first work
that exploits GPUs to efficiently solve repeated range queries over massive
sets of continuously moving objects, characterized by highly skewed spatial
distributions. In comparison with state-of-the-art CPU-based implementations,
our method highlights significant speedups in the order of 14x-20x, depending
on the datasets, even when considering very cheap GPUs
LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs
The number of linked data sources and the size of the linked open data graph
keep growing every day. As a consequence, semantic RDF services are more and
more confronted with various "big data" problems. Query processing in the
presence of inferences is one them. For instance, to complete the answer set of
SPARQL queries, RDF database systems evaluate semantic RDFS relationships
(subPropertyOf, subClassOf) through time-consuming query rewriting algorithms
or space-consuming data materialization solutions. To reduce the memory
footprint and ease the exchange of large datasets, these systems generally
apply a dictionary approach for compressing triple data sizes by replacing
resource identifiers (IRIs), blank nodes and literals with integer values. In
this article, we present a structured resource identification scheme using a
clever encoding of concepts and property hierarchies for efficiently evaluating
the main common RDFS entailment rules while minimizing triple materialization
and query rewriting. We will show how this encoding can be computed by a
scalable parallel algorithm and directly be implemented over the Apache Spark
framework. The efficiency of our encoding scheme is emphasized by an evaluation
conducted over both synthetic and real world datasets.Comment: 8 pages, 1 figur
Practical Private Information Retrieval
In recent years, the subject of online privacy has been attracting much interest, especially as more Internet users than ever are beginning to care about the privacy of their online activities. Privacy concerns are even prompting legislators in some countries to demand from service providers a more privacy-friendly Internet experience for their citizens. These are welcomed developments and in stark contrast to the practice of Internet censorship and surveillance that legislators in some nations have been known to promote. The development of Internet systems that are able to protect user privacy requires private information retrieval (PIR) schemes that are practical, because no other efficient techniques exist for preserving the confidentiality of the retrieval requests and responses of a user from an Internet system holding unencrypted data. This thesis studies how PIR schemes can be made more relevant and practical for the development of systems that are protective of users' privacy.
Private information retrieval schemes are cryptographic constructions for retrieving data from a database, without the database (or database administrator) being able to learn any information about the content of the query. PIR can be applied to preserve the confidentiality of queries to online data sources in many domains, such as online patents, real-time stock quotes, Internet domain names, location-based services, online behavioural profiling and advertising, search engines, and so on.
In this thesis, we study private information retrieval and obtain results that seek to make PIR more relevant in practice than all previous treatments of the subject in the literature, which have been mostly theoretical. We also show that PIR is the most computationally efficient known technique for providing access privacy under realistic computation powers and network bandwidths. Our result covers all currently known varieties of PIR schemes. We provide a more detailed summary of our contributions below:
Our first result addresses an existing question regarding the computational practicality of private information retrieval schemes. We show that, unlike previously argued, recent lattice-based computational PIR schemes and multi-server information-theoretic PIR schemes are much more computationally efficient than a trivial transfer of the entire PIR database from the server to the client (i.e., trivial download). Our result shows the end-to-end response times of these schemes are one to three orders of magnitude (10--1000 times) smaller than the trivial download of the database for realistic computation powers and network bandwidths. This result extends and clarifies the well-known result of Sion and Carbunar on the computational practicality of PIR.
Our second result is a novel approach for preserving the privacy of sensitive constants in an SQL query, which improves substantially upon the earlier work. Specifically, we provide an expressive data access model of SQL atop of the existing rudimentary index- and keyword-based data access models of PIR. The expressive SQL-based model developed results in between 7 and 480 times improvement in query throughput than previous work.
We then provide a PIR-based approach for preserving access privacy over large databases. Unlike previously published access privacy approaches, we explore new ideas about privacy-preserving constraint-based query transformations, offline data classification, and privacy-preserving queries to index structures much smaller than the databases. This work addresses an important open problem about how real systems can systematically apply existing PIR schemes for querying large databases.
In terms of applications, we apply PIR to solve user privacy problem in the domains of patent database query and location-based services, user and database privacy problems in the domain of the online sales of digital goods, and a scalability problem for the Tor anonymous communication network.
We develop practical tools for most of our techniques, which can be useful for adding PIR support to existing and new Internet system designs
Dynamic Integration of Evolving Distributed Databases using Services
This thesis investigates the integration of many separate existing heterogeneous and distributed databases which, due to organizational changes, must be merged and appear as one database. A solution to some database evolution problems is presented. It presents an Evolution Adaptive Service-Oriented Data Integration Architecture (EA-SODIA) to dynamically integrate heterogeneous and distributed source databases, aiming to minimize the cost of the maintenance caused by database evolution.
An algorithm, named Relational Schema Mapping by Views (RSMV), is designed to integrate source databases that are exposed as services into a pre-designed global schema that is in a data integrator service. Instead of producing hard-coded programs, views are built using relational algebra operations to eliminate the heterogeneities among the source databases. More importantly, the definitions of those views are represented and stored in the meta-database with some constraints to test their validity. Consequently, the method, called Evolution Detection, is then able to identify in the meta-database the views affected by evolutions and then modify them automatically.
An evaluation is presented using case study. Firstly, it is shown that most types of heterogeneity defined in this thesis can be eliminated by RSMV, except semantic conflict. Secondly, it presents that few manual modification on the system is required as long as the evolutions follow the rules. For only three types of database evolutions, human intervention is required and some existing views are discarded. Thirdly, the computational cost of the automatic modification shows a slow linear growth in the number of source database. Other characteristics addressed include EA-SODIA’ scalability, domain independence, autonomy of source databases, and potential of involving other data sources (e.g.XML). Finally, the descriptive comparison with other data integration approaches is presented. It shows that although other approaches may provide better performance of query processing in some circumstances, the service-oriented architecture provide better autonomy, flexibility and capability of evolution
Scene Understanding For Real Time Processing Of Queries Over Big Data Streaming Video
With heightened security concerns across the globe and the increasing need to monitor, preserve and protect infrastructure and public spaces to ensure proper operation, quality assurance and safety, numerous video cameras have been deployed. Accordingly, they also need to be monitored effectively and efficiently. However, relying on human operators to constantly monitor all the video streams is not scalable or cost effective. Humans can become subjective, fatigued, even exhibit bias and it is difficult to maintain high levels of vigilance when capturing, searching and recognizing events that occur infrequently or in isolation. These limitations are addressed in the Live Video Database Management System (LVDBMS), a framework for managing and processing live motion imagery data. It enables rapid development of video surveillance software much like traditional database applications are developed today. Such developed video stream processing applications and ad hoc queries are able to reuse advanced image processing techniques that have been developed. This results in lower software development and maintenance costs. Furthermore, the LVDBMS can be intensively tested to ensure consistent quality across all associated video database applications. Its intrinsic privacy framework facilitates a formalized approach to the specification and enforcement of verifiable privacy policies. This is an important step towards enabling a general privacy certification for video surveillance systems by leveraging a standardized privacy specification language. With the potential to impact many important fields ranging from security and assembly line monitoring to wildlife studies and the environment, the broader impact of this work is clear. The privacy framework protects the general public from abusive use of surveillance technology; iii success in addressing the trust issue will enable many new surveillance-related applications. Although this research focuses on video surveillance, the proposed framework has the potential to support many video-based analytical applications
- …