11 research outputs found
Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications
Modern scientific instruments produce vast amounts of data, which can
overwhelm the processing ability of computer systems. Lossy compression of data
is an intriguing solution, but comes with its own drawbacks, such as potential
signal loss, and the need for careful optimization of the compression ratio. In
this work, we focus on a setting where this problem is especially acute:
compressive sensing frameworks for interferometry and medical imaging. We ask
the following question: can the precision of the data representation be lowered
for all inputs, with recovery guarantees and practical performance? Our first
contribution is a theoretical analysis of the normalized Iterative Hard
Thresholding (IHT) algorithm when all input data, meaning both the measurement
matrix and the observation vector are quantized aggressively. We present a
variant of low precision normalized {IHT} that, under mild conditions, can
still provide recovery guarantees. The second contribution is the application
of our quantization framework to radio astronomy and magnetic resonance
imaging. We show that lowering the precision of the data can significantly
accelerate image recovery. We evaluate our approach on telescope data and
samples of brain images using CPU and FPGA implementations achieving up to a 9x
speed-up with negligible loss of recovery quality.Comment: 19 pages, 5 figures, 1 table, in IEEE Transactions on Signal
Processin
A Data Quality-Driven View of MLOps
Developing machine learning models can be seen as a process similar to the
one established for traditional software development. A key difference between
the two lies in the strong dependency between the quality of a machine learning
model and the quality of the data used to train or perform evaluations. In this
work, we demonstrate how different aspects of data quality propagate through
various stages of machine learning development. By performing a joint analysis
of the impact of well-known data quality dimensions and the downstream machine
learning process, we show that different components of a typical MLOps pipeline
can be efficiently designed, providing both a technical and theoretical
perspective
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions
Machine learning (ML) applications have been thriving recently, largely
attributed to the increasing availability of data. However, inconsistency and
incomplete information are ubiquitous in real-world datasets, and their impact
on ML applications remains elusive. In this paper, we present a formal study of
this impact by extending the notion of Certain Answers for Codd tables, which
has been explored by the database research community for decades, into the
field of machine learning. Specifically, we focus on classification problems
and propose the notion of "Certain Predictions" (CP) -- a test data example can
be certainly predicted (CP'ed) if all possible classifiers trained on top of
all possible worlds induced by the incompleteness of data would yield the same
prediction.
We study two fundamental CP queries: (Q1) checking query that determines
whether a data example can be CP'ed; and (Q2) counting query that computes the
number of classifiers that support a particular prediction (i.e., label). Given
that general solutions to CP queries are, not surprisingly, hard without
assumption over the type of classifier, we further present a case study in the
context of nearest neighbor (NN) classifiers, where efficient solutions to CP
queries can be developed -- we show that it is possible to answer both queries
in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of
"data cleaning for machine learning (DC for ML)." We show that our proposed
CPClean approach built based on CP can often significantly outperform existing
techniques in terms of classification accuracy with mild manual cleaning
effort
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning
Methods for carefully selecting or generating a small set of training data to
learn from, i.e., data pruning, coreset selection, and data distillation, have
been shown to be effective in reducing the ever-increasing cost of training
neural networks. Behind this success are rigorously designed strategies for
identifying informative training examples out of large datasets. However, these
strategies come with additional computational costs associated with subset
selection or data distillation before training begins, and furthermore, many
are shown to even under-perform random sampling in high data compression
regimes. As such, many data pruning, coreset selection, or distillation methods
may not reduce 'time-to-accuracy', which has become a critical efficiency
measure of training deep neural networks over large datasets. In this work, we
revisit a powerful yet overlooked random sampling strategy to address these
challenges and introduce an approach called Repeated Sampling of Random Subsets
(RSRS or RS2), where we randomly sample the subset of training data for each
epoch of model training. We test RS2 against thirty state-of-the-art data
pruning and data distillation methods across four datasets including ImageNet.
Our results demonstrate that RS2 significantly reduces time-to-accuracy
compared to existing techniques. For example, when training on ImageNet in the
high-compression regime (using less than 10% of the dataset each epoch), RS2
yields accuracy improvements up to 29% compared to competing pruning methods
while offering a runtime reduction of 7x. Beyond the above meta-study, we
provide a convergence analysis for RS2 and discuss its generalization
capability. The primary goal of our work is to establish RS2 as a competitive
baseline for future data selection or distillation techniques aimed at
efficient training
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and
meetings prior, in this report we outline the relevance of community engagement
and infrastructure development for the creation of next-generation public
datasets that will advance machine learning science. We chart a path forward as
a collective effort to sustain the creation and maintenance of these datasets
and methods towards positive scientific, societal and business impact.Comment: This editorial report accompanies the inaugural Data-centric Machine
Learning Research (DMLR) Workshop that took place at ICML 2023
https://dmlr.ai
Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks
Despite the great successes achieved by deep neural networks (DNNs), recent studies show that they are vulnerable against adversarial examples, which aim to mislead DNNs by adding small adversarial perturbations. Several defenses have been proposed against such attacks, while many of them have been adaptively attacked. In this work, we aim to enhance the ML robustness from a different perspective by leveraging domain knowledge: We propose a Knowledge Enhanced Machine Learning Pipeline (KEMLP) to integrate domain knowledge (i.e., logic relationships among different predictions) into a probabilistic graphical model via first-order logic rules. In particular, we develop KEMLP by integrating a diverse set of weak auxiliary models based on their logical relationships to the main DNN model that performs the target task. Theoretically, we provide convergence results and prove that, under mild conditions, the prediction of KEMLP is more robust than that of the main DNN model. Empirically, we take road sign recognition as an example and leverage the relationships between road signs and their shapes and contents as domain knowledge. We show that compared with adversarial training and other baselines, KEMLP achieves higher robustness against physical attacks, L-p bounded attacks, unforeseen attacks, and natural corruptions under both whitebox and blackbox settings, while still maintaining high clean accuracy.ISSN:2640-349
A Data Quality-Driven View of MLOps
Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing a joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective