Towards Scalable, Cloud Based, Confidential Data Stream Processing

Abstract

Increasing data availability, velocity, variability, and size have lead to the development of new data processing paradigms that offer users different ways to process and manage data specific to their needs. One such paradigm is data stream processing, as managed by Data Stream Processing Systems (DSPS). In contrast to traditional database management systems wherein data is stationary and queries are transient, in stream processing systems, data is transient and queries are stationary (that is, continuous and long running). In such systems, users are expecting to process temporal data, where data is only considered for some period of time, and discarded after. Often, as with many other software applications, those who employ such systems will outsource computation to third party computation platforms such as Amazon, IBM, or Google. The use of third parties not only outsources computation, but it outsources hardware and software maintenance costs as well, relieving the user from having to incur these costs themselves. Moreover, when a user outsources their DSPS, they often have some service level agreement that places guarantees on service availability and uptime. Given the above benefits to outsourcing computation, it is clearly desirable for a user to outsource their DSPS computation. Such outsourcing, however, may violate the privacy constraints of the those who provide the data stream. Specifically, they may not wish to share their plaintext data with a third-party that they may not trust. This leads to an interesting dichotomy between the desire of the user to outsource as much of their computation as possible and the desire of the data stream providers to keep their data private and avoid leaking data to a third-party system. Current work that explores linking the two poles of this dichotomy either limits the expressiveness of supported queries, requires the data provider to trust the third-party systems, or incurs computational or monetary overheads prohibitive for the querier. In this dissertation, we explore the methods for shrinking the gap between the poles of this dichotomy and overcome the limitation of the state-of-the art systems by providing data providers and queriers with efficient access control enforcement on untrusted third party systems over encrypted data. Specifically, we introduce our system PolyStream for executing queries on encrypted data using computation-enabling encryption, with an online key management system. We further introduce Sanctuary to provide computation on any data on third-party systems using trusted hardware. Finally we introduce Shoal, our query optimizer that considers the heterogeneous nature of streaming systems at optimization time to improve query performance when access controls are enforced on the streaming data. Through the union of the contributions of this dissertation, we show that considering access controls at optimization time can lead to better utilization, performance, and protection for streaming data

    Similar works