331 research outputs found
Recommended from our members
New Data Protection Abstractions for Emerging Mobile and Big Data Workloads
Two recent shifts in computing are challenging the effectiveness of traditional approaches to data protection. Emerging machine learning workloads have complex access patterns and unique leakage characteristics that are not well supported by existing protection approaches. Second, mobile operating systems do not provide sufficient support for fine grained data protection tools forcing users to rely on individual applications to correctly manage and protect data. My thesis is that these emerging workloads have unique characteristics that we can leverage to build new, more effective data protection abstractions.
This dissertation presents two new data protection systems for machine learning work-loads and a new system for fine grained data management and protection on mobile devices. First is Sage, a differentially private machine learning platform addressing the two primary challenges of differential privacy: running out of budget and the privacy utility tradeoff. The second system, Pyramid, is the first selective data system. Pyramid leverages count featurization to reduce the amount of data exposed while training classification models by two orders of magnitude. The final system, Pebbles, provides users with logical data objects as a new fine grained data management and protection primitive allowing data management at a higher level of abstraction. Pebbles, leverages high level storage abstractions in mobile operating systems to discover user recognizable application level data objects in unmodified mobile applications
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases
Recommended from our members
Security, Privacy, and Transparency Guarantees for Machine Learning Systems
Machine learning (ML) is transforming a wide range of applications, promising to bring immense economic and social benefits. However, it also raises substantial security, privacy, and transparency challenges. ML workloads indeed push companies toward aggressive data collection and loose data access policies, placing troves of sensitive user information at risk if the company is hacked. ML also introduces new attack vectors, such as adversarial example attacks, which can completely nullify models’ accuracy under attack. Finally, ML models make complex data-driven decisions, which are opaque to the end-users, and difficult to inspect for programmers. In this dissertation we describe three systems we developed. Each system addresses a dimension of the previous challenges, by combining new practical systems techniques with rigorous theory to achieve a guaranteed level of protection, and make systems easier to understand. First we present Sage, a differentially private ML platform that enforces a meaningful protection semantic for the troves of personal information amassed by today’s companies. Second we describe PixelDP, a defense against adversarial examples that leverages differential privacy theory to provide a guaranteed level of accuracy under attack. Third we introduce Sunlight, a tool to enhance the transparency of opaque targeting services, using rigorous causal inference theory to explain targeting decisions to end-users
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
State Management for Efficient Event Pattern Detection
Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst.
Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt.
In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events.
State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified.
In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache.
I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency
- …