11 research outputs found
RAB: Provable Robustness Against Backdoor Attacks
Recent studies have shown that deep neural networks (DNNs) are vulnerable to
adversarial attacks, including evasion and backdoor (poisoning) attacks. On the
defense side, there have been intensive efforts on improving both empirical and
provable robustness against evasion attacks; however, provable robustness
against backdoor attacks still remains largely unexplored. In this paper, we
focus on certifying the machine learning model robustness against general
threat models, especially backdoor attacks. We first provide a unified
framework via randomized smoothing techniques and show how it can be
instantiated to certify the robustness against both evasion and backdoor
attacks. We then propose the first robust training process, RAB, to smooth the
trained model and certify its robustness against backdoor attacks. We derive
the robustness bound for machine learning models trained with RAB, and prove
that our robustness bound is tight. In addition, we show that it is possible to
train the robust smoothed models efficiently for simple models such as
K-nearest neighbor classifiers, and we propose an exact smooth-training
algorithm which eliminates the need to sample from a noise distribution for
such models. Empirically, we conduct comprehensive experiments for different
machine learning (ML) models such as DNNs, differentially private DNNs, and
K-NN models on MNIST, CIFAR-10 and ImageNet datasets, and provide the first
benchmark for certified robustness against backdoor attacks. In addition, we
evaluate K-NN models on a spambase tabular dataset to demonstrate the
advantages of the proposed exact algorithm. Both the theoretic analysis and the
comprehensive evaluation on diverse ML models and datasets shed lights on
further robust learning strategies against general training time attacks.Comment: 31 pages, 5 figures, 7 table
A Data Quality-Driven View of MLOps
Developing machine learning models can be seen as a process similar to the
one established for traditional software development. A key difference between
the two lies in the strong dependency between the quality of a machine learning
model and the quality of the data used to train or perform evaluations. In this
work, we demonstrate how different aspects of data quality propagate through
various stages of machine learning development. By performing a joint analysis
of the impact of well-known data quality dimensions and the downstream machine
learning process, we show that different components of a typical MLOps pipeline
can be efficiently designed, providing both a technical and theoretical
perspective
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions
Machine learning (ML) applications have been thriving recently, largely
attributed to the increasing availability of data. However, inconsistency and
incomplete information are ubiquitous in real-world datasets, and their impact
on ML applications remains elusive. In this paper, we present a formal study of
this impact by extending the notion of Certain Answers for Codd tables, which
has been explored by the database research community for decades, into the
field of machine learning. Specifically, we focus on classification problems
and propose the notion of "Certain Predictions" (CP) -- a test data example can
be certainly predicted (CP'ed) if all possible classifiers trained on top of
all possible worlds induced by the incompleteness of data would yield the same
prediction.
We study two fundamental CP queries: (Q1) checking query that determines
whether a data example can be CP'ed; and (Q2) counting query that computes the
number of classifiers that support a particular prediction (i.e., label). Given
that general solutions to CP queries are, not surprisingly, hard without
assumption over the type of classifier, we further present a case study in the
context of nearest neighbor (NN) classifiers, where efficient solutions to CP
queries can be developed -- we show that it is possible to answer both queries
in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of
"data cleaning for machine learning (DC for ML)." We show that our proposed
CPClean approach built based on CP can often significantly outperform existing
techniques in terms of classification accuracy with mild manual cleaning
effort
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
Data Systems for Managing and Debugging Machine Learning Workflows
As machine learning continues becoming more ubiquitous in various areas of our lives, it will become impossible to imagine software development projects that do not involve some learned component. Consequently, we have an ever increasing number of people developing ML applications, which drives the need for better development tools and processes. Unfortunately, even though there has been tremendous effort spent in building various systems for machine learning, the development experience is still far from what regular software engineers enjoy. This is mainly because the current ML tooling is very much focused on solving specific problems and cover only a part of the development workflow. Furthermore, end-to-end integration of these various tools is still quite limited. This very often leaves the developers stuck without guidance as they try to make their way through a labyrinth of possible choices that could be made at each step.
This thesis aims to tackle the usability problem of modern machine learning systems. This involves taking a broader view which goes beyond the model training part of the ML workflow and developing a system for managing this workflow. This broader workflow includes the data preparation process which comes before model training, as well model management which comes after. We seek to identify various pitfalls and pain points that developers encounter in these ML workflows. We then zoom into one particular kind of a usability pain point -- labor-efficient data debugging. We pinpoint two categories of data errors (missing data and wrong data), and develop two methods for guiding the attention of the developer by helping them choose the instances of data errors that are the most important. We then empirically evaluate those methods in realistic data repair scenarios and demonstrate that they indeed improve the efficiency of the data debugging process, which in turn translates to greater usability. We finish up with some insights which could be applied to design more usable machine learning systems in the future
Proactively Screening Machine Learning Pipelines with ARGUSEYES
Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline
A Data Quality-Driven View of MLOps
Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing a joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective
Ease.ml/ci and Ease.ml/meter in Action: Towards Data Management for Statistical Generalization
Developing machine learning (ML) applications is similar to developing traditional software-it is often an iterative process in which developers navigate within a rich space of requirements, design decisions, implementations, empirical quality, and performance. In traditional software development, software engineering is the field of study which provides principled guidelines for this iterative process. However, as of today, the counterpart of "software engineering for ML" is largely missing developers of ML applications are left with powerful tools (e.g., TensorFlow and PyTorch) but little guidance regarding the development lifecycle itself. In this paper, we view the management of ML development lifecycles from a data management perspective. We demonstrate two closely related systems, ease.ml/ci and ease.ml/meter, that provide some "principled guidelines" for ML application development: ci is a continuous integration engine for ML models and meter is a "profiler" for controlling overfitting of ML models. Both systems focus on managing the "statistical generalization power" of datasets used for assessing the quality of ML applications, namely, the validation set and the test set. By demonstrating these two systems we hope to spawn further discussions within our community on building this new type of data management systems for statistical generalization.ISSN:2150-809
Online Active Model Selection for Pre-trained Classifiers
Given k pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies.ISSN:2640-349