797 research outputs found
General Boolean Expressions in Publish-Subscribe Systems
The increasing amount of electronically available information in society today is undeniable. Examples include the numbers of general web pages, scientific publications, and items in online auctions. From a user's perspective, this trend will lead to information overflow. Moreover, information publishers are compromised by this situation, as users have greater difficulty in identifying useful information.
Publish-subscribe systems can be applied to cope with the reality of information overflow. In these systems, users specify their information interests as subscriptions and, subsequently, only matching information (event messages) is delivered; uninteresting information is filtered out before reaching users. In this dissertation, we consider content-based publish-subscribe systems, a sophisticated example of these systems. They perform the information-filtering task based on the content of provided information. In order to deal with high numbers of subscriptions and frequencies of event messages, publish-subscribe systems are realized as distributed systems. Advertisements---publisher specifications of potential future event messages---are optionally applied in these systems to reduce the internal distribution of subscriptions.
Existing work on content-based publish-subscribe concepts mainly focuses on subscriptions and advertisements as pure conjunctive expressions. Therefore, subscriptions or advertisements using operators other than conjunction need to be canonically converted to disjunctive normal form by these systems. Each conjunctive component is then treated as individual subscription or advertisement. Unfortunately, the size of converted expressions is exponential in the worst case.
In this dissertation, we show that the direct support of general Boolean subscriptions and advertisements improves the time and space efficiency of general-purpose content-based publish-subscribe systems. For this purpose, we develop suitable approaches for the filtering and routing of general Boolean expressions in these systems. Our approaches represent solutions to exactly those components of content-based publish-subscribe systems that currently restrict subscriptions and advertisements to conjunctive expressions.
On the subscription side, we present an effective generic filtering algorithm, and a novel approach to optimize event routing tables, which we call subscription pruning. To support advertisements, we show how to calculate the overlap between subscriptions and advertisements, and introduce the first designated subscription routing optimization, which we refer to as advertisement pruning. We integrate these approaches into our prototype BoP (BOolean Publish-subscribe) which allows for the full support of general Boolean expressions in its filtering and routing components.
In the evaluation part of this dissertation, we empirically analyze our prototypical implementation BoP and compare its algorithms to existing conjunctive solutions. We firstly show that our general-purpose Boolean filtering algorithm is more space- and time-efficient than a general-purpose conjunctive filtering algorithm. Secondly, we illustrate the effectiveness of the subscription pruning routing optimization and compare it to the existing covering optimization approach. Finally, we demonstrate the optimization effect of advertisement pruning while maintaining the existing overlapping relationships in the system
Doctor of Philosophy
dissertationLinked data are the de-facto standard in publishing and sharing data on the web. To date, we have been inundated with large amounts of ever-increasing linked data in constantly evolving structures. The proliferation of the data and the need to access and harvest knowledge from distributed data sources motivate us to revisit several classic problems in query processing and query optimization. The problem of answering queries over views is commonly encountered in a number of settings, including while enforcing security policies to access linked data, or when integrating data from disparate sources. We approach this problem by efficiently rewriting queries over the views to equivalent queries over the underlying linked data, thus avoiding the costs entailed by view materialization and maintenance. An outstanding problem of query rewriting is the number of rewritten queries is exponential to the size of the query and the views, which motivates us to study problem of multiquery optimization in the context of linked data. Our solutions are declarative and make no assumption for the underlying storage, i.e., being store-independent. Unlike relational and XML data, linked data are schema-less. While tracking the evolution of schema for linked data is hard, keyword search is an ideal tool to perform data integration. Existing works make crippling assumptions for the data and hence fall short in handling massive linked data with tens to hundreds of millions of facts. Our study for keyword search on linked data brought together the classical techniques in the literature and our novel ideas, which leads to much better query efficiency and quality of the results. Linked data also contain rich temporal semantics. To cope with the ever-increasing data, we have investigated how to partition and store large temporal or multiversion linked data for distributed and parallel computation, in an effort to achieve load-balancing to support scalable data analytics for massive linked data
Recommended from our members
Enhancing Usability and Explainability of Data Systems
The recent growth of data science expanded its reach to an ever-growing user base of nonexperts, increasing the need for usability, understandability, and explainability in these systems. Enhancing usability makes data systems accessible to people with different skills and backgrounds alike, leading to democratization of data systems. Furthermore, proper understanding of data and data-driven systems is necessary for the users to trust the function of the systems that learn from data. Finally, data systems should be transparent: when a data system behaves unexpectedly or malfunctions, the users deserve proper explanation of what caused the observed incident. Unfortunately, most existing data systems offer limited usability and support for explanations: these systems are usable only by experts with sound technical skills, and even expert users are hindered by the lack of transparency into the systems\u27 inner workings and functions. The aim of my thesis is to bridge the usability gap between nonexpert users and complex data systems, aid all sort of users, including the expert ones, in data and system understanding, and provide explanations that help reason about unexpected outcomes involving data systems. Specifically, my thesis has the following three goals: (1) enhancing usability of data systems for nonexperts, (2) enable data understanding that can assist users in a variety of tasks such as achieving trust in data-driven machine learning, gaining data understanding, and data cleaning, and (3) explaining causes of unexpected outcomes involving data and data systems.
For enhancing usability, we focus on example-driven user intent discovery. We develop systems based on example-driven interactions in two different settings: querying relational databases and personalized document summarization. Towards data understanding, we develop a new data-profiling primitive that can characterize tuples for which a machine-learned model is likely to produce untrustworthy predictions. We also develop an explanation framework to explain causes of such untrustworthy predictions. Additionally, this new data-profiling primitive enables interactive data cleaning. Finally, we develop two explanation frameworks, tailored to provide explanations in debugging data system components, including the data itself. The explanation frameworks focus on explaining the root cause of a concurrent application\u27s intermittent failure and exposing issues in the data that cause a data-driven system to malfunction
Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing
In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems
- ā¦