81 research outputs found
Recent Advances of Differential Privacy in Centralized Deep Learning: A Systematic Survey
Differential Privacy has become a widely popular method for data protection
in machine learning, especially since it allows formulating strict mathematical
privacy guarantees. This survey provides an overview of the state-of-the-art of
differentially private centralized deep learning, thorough analyses of recent
advances and open problems, as well as a discussion of potential future
developments in the field. Based on a systematic literature review, the
following topics are addressed: auditing and evaluation methods for private
models, improvements of privacy-utility trade-offs, protection against a broad
range of threats and attacks, differentially private generative models, and
emerging application domains.Comment: 35 pages, 2 figure
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Data Distillation: A Survey
The popularity of deep learning has led to the curation of a vast number of
massive and multifarious datasets. Despite having close-to-human performance on
individual tasks, training parameter-hungry models on large datasets poses
multi-faceted problems such as (a) high model-training time; (b) slow research
iteration; and (c) poor eco-sustainability. As an alternative, data
distillation approaches aim to synthesize terse data summaries, which can serve
as effective drop-in replacements of the original dataset for scenarios like
model training, inference, architecture search, etc. In this survey, we present
a formal framework for data distillation, along with providing a detailed
taxonomy of existing approaches. Additionally, we cover data distillation
approaches for different data modalities, namely images, graphs, and user-item
interactions (recommender systems), while also identifying current challenges
and future research directions.Comment: Accepted at TMLR '23. 21 pages, 4 figure
Making Translators Privacy-aware on the User's Side
We propose PRISM to enable users of machine translation systems to preserve
the privacy of data on their own initiative. There is a growing demand to apply
machine translation systems to data that require privacy protection. While
several machine translation engines claim to prioritize privacy, the extent and
specifics of such protection are largely ambiguous. First, there is often a
lack of clarity on how and to what degree the data is protected. Even if
service providers believe they have sufficient safeguards in place,
sophisticated adversaries might still extract sensitive information. Second,
vulnerabilities may exist outside of these protective measures, such as within
communication channels, potentially leading to data leakage. As a result, users
are hesitant to utilize machine translation engines for data demanding high
levels of privacy protection, thereby missing out on their benefits. PRISM
resolves this problem. Instead of relying on the translation service to keep
data safe, PRISM provides the means to protect data on the user's side. This
approach ensures that even machine translation engines with inadequate privacy
measures can be used securely. For platforms already equipped with privacy
safeguards, PRISM acts as an additional protection layer, reinforcing their
security furthermore. PRISM adds these privacy features without significantly
compromising translation accuracy. Our experiments demonstrate the
effectiveness of PRISM using real-world translators, T5 and ChatGPT
(GPT-3.5-turbo), and the datasets with two languages. PRISM effectively
balances privacy protection with translation accuracy
MetaRec: Meta-Learning Meets Recommendation Systems
Artificial neural networks (ANNs) have recently received increasing attention as powerful modeling tools to improve the performance of recommendation systems. Meta-learning, on the other hand, is a paradigm that has re-surged in popularity within the broader machine learning community over the past several years. In this thesis, we will explore the intersection of these two domains and work on developing methods for integrating meta-learning to design more accurate and flexible recommendation systems.
In the present work, we propose a meta-learning framework for the design of collaborative filtering methods in recommendation systems, drawing from ideas, models, and solutions from modern approaches in both the meta-learning and recommendation system literature, applying them to recommendation tasks to obtain improved generalization performance.
Our proposed framework, MetaRec, includes and unifies the main state-of-the-art models in recommendation systems, extending them to be flexibly configured and efficiently operate with limited data. We empirically test the architectures created under our MetaRec framework on several recommendation benchmark datasets using a plethora of evaluation metrics and find that by taking a meta-learning approach to the collaborative filtering problem, we observe notable gains in predictive performance
Privacy-preserving Recommender Systems Facilitated By The Machine Learning Approach
Recommender systems, which play a critical role in e-business services, are closely linked to our daily life. For example, companies such as Youtube and Amazon are always trying to secure their profit by estimating personalized user preferences and recommending the most relevant items (e.g., products, news, etc.) to each user from a large number of candidates. State-of-the-art recommender systems are often built on top of collaborative filtering techniques, of which the accuracy performance relies on precisely modeling user-item interactions by analyzing massive user historical data, such as browsing history, purchasing records, locations and so on. Generally, more data can lead to more accurate estimations and more commercial strategies, as such, service providers have incentives to collect and use more user data. On the one hand, recommender systems bring more income to service providers and more convenience to users; on the other hand, the user data can be abused, arising immediate privacy risks to the public. Therefore, how to preserve privacy while enjoying recommendation services becomes an increasingly important topic to both the research community and commercial practitioners.
The privacy concerns can be disparate when constructing recommender systems or providing recommendation services under different scenarios. One scenario is that, a service provider wishes to protect its data privacy from the inference attack, a technique aims to infer more information (e.g., whether a record is in or not) about a database, by analyzing statistical outputs; the other scenario is that, multiple users agree to jointly perform a recommendation task, but none of them is willing to share their private data with any other users. Security primitives, such as homomorphic encryption, secure multiparty computation, and differential privacy, are immediate candidates to address privacy concerns. A typical approach to build efficient and accurate privacy-preserving solutions is to improve the security primitives, and then apply them to existing recommendation algorithms. However, this approach often yields a solution far from the satisfactory-of-practice, as most users have a low tolerance to the latency-increase or accuracy-drop, regarding recommendation services.
The PhD program explores machine learning aided approaches to build efficient privacy-preserving solutions for recommender systems. The results of each proposed solution demonstrate that machine learning can be a strong assistant for privacy-preserving, rather than only a troublemaker
Robust Representation Learning for Privacy-Preserving Machine Learning: A Multi-Objective Autoencoder Approach
Several domains increasingly rely on machine learning in their applications.
The resulting heavy dependence on data has led to the emergence of various laws
and regulations around data ethics and privacy and growing awareness of the
need for privacy-preserving machine learning (ppML). Current ppML techniques
utilize methods that are either purely based on cryptography, such as
homomorphic encryption, or that introduce noise into the input, such as
differential privacy. The main criticism given to those techniques is the fact
that they either are too slow or they trade off a model s performance for
improved confidentiality. To address this performance reduction, we aim to
leverage robust representation learning as a way of encoding our data while
optimizing the privacy-utility trade-off. Our method centers on training
autoencoders in a multi-objective manner and then concatenating the latent and
learned features from the encoding part as the encoded form of our data. Such a
deep learning-powered encoding can then safely be sent to a third party for
intensive training and hyperparameter tuning. With our proposed framework, we
can share our data and use third party tools without being under the threat of
revealing its original form. We empirically validate our results on unimodal
and multimodal settings, the latter following a vertical splitting system and
show improved performance over state-of-the-art
Private Learning with Public Features
We study a class of private learning problems in which the data is a join of
private and public features. This is often the case in private personalization
tasks such as recommendation or ad prediction, in which features related to
individuals are sensitive, while features related to items (the movies or songs
to be recommended, or the ads to be shown to users) are publicly available and
do not require protection. A natural question is whether private algorithms can
achieve higher utility in the presence of public features. We give a positive
answer for multi-encoder models where one of the encoders operates on public
features. We develop new algorithms that take advantage of this separation by
only protecting certain sufficient statistics (instead of adding noise to the
gradient). This method has a guaranteed utility improvement for linear
regression, and importantly, achieves the state of the art on two standard
private recommendation benchmarks, demonstrating the importance of methods that
adapt to the private-public feature separation
- …