26 research outputs found
A framework for the metrification of input image quality in deep networks
Deep Neural Networks (DNNs) are critical for real-time imaging applications including autonomous vehicles. DNNs are often trained and validated with images that originate from a limited number of cameras, each of which has its own hardware and image signal processing (ISP) characteristics. However, in most real-time embedded systems, the input images come from a variety of cameras with different ISP pipelines, and often include perturbations due to a variety of scene conditions. Data augmentation methods are commonly exploited to enhance the robustness of such systems. Alternatively, methods are employed to detect input images that are unfamiliar to the trained networks, including out of distribution detection. Despite these efforts DNNs remain widely systems with operational boundaries that cannot be easily defined. One reason is that, while training and benchmark image datasets include samples with a variety of perturbations, there is a lack of research in the areas of metrification of input image quality suitable to DNNs and a universal method to relate quality to DNN performance using meaningful quality metrics. This paper addresses this lack of metrification specific to DNNs systems and introduces a framework that uses systematic modification of image quality attributes and relate input image quality to DNN performance
MixRL: Data Mixing Augmentation for Regression using Reinforcement Learning
Data augmentation is becoming essential for improving regression accuracy in
critical applications including manufacturing and finance. Existing techniques
for data augmentation largely focus on classification tasks and do not readily
apply to regression tasks. In particular, the recent Mixup techniques for
classification rely on the key assumption that linearity holds among training
examples, which is reasonable if the label space is discrete, but has
limitations when the label space is continuous as in regression. We show that
mixing examples that either have a large data or label distance may have an
increasingly-negative effect on model performance. Hence, we use the stricter
assumption that linearity only holds within certain data or label distances for
regression where the degree may vary by each example. We then propose MixRL, a
data augmentation meta learning framework for regression that learns for each
example how many nearest neighbors it should be mixed with for the best model
performance using a small validation set. MixRL achieves these objectives using
Monte Carlo policy gradient reinforcement learning. Our experiments conducted
both on synthetic and real datasets show that MixRL significantly outperforms
state-of-the-art data augmentation baselines. MixRL can also be integrated with
other classification Mixup techniques for better results.Comment: 15 pages, 9 figures, 7 table
OpenDataVal: a Unified Benchmark for Data Valuation
Assessing the quality and impact of individual data points is critical for
improving model performance and mitigating undesirable biases within the
training dataset. Several data valuation algorithms have been proposed to
quantify data quality, however, there lacks a systemic and standardized
benchmarking system for data valuation. In this paper, we introduce
OpenDataVal, an easy-to-use and unified benchmark framework that empowers
researchers and practitioners to apply and compare various data valuation
algorithms. OpenDataVal provides an integrated environment that includes (i) a
diverse collection of image, natural language, and tabular datasets, (ii)
implementations of eleven different state-of-the-art data valuation algorithms,
and (iii) a prediction model API that can import any models in scikit-learn.
Furthermore, we propose four downstream machine learning tasks for evaluating
the quality of data values. We perform benchmarking analysis using OpenDataVal,
quantifying and comparing the efficacy of state-of-the-art data valuation
approaches. We find that no single algorithm performs uniformly best across all
tasks, and an appropriate algorithm should be employed for a user's downstream
task. OpenDataVal is publicly available at https://opendataval.github.io with
comprehensive documentation. Furthermore, we provide a leaderboard where
researchers can evaluate the effectiveness of their own data valuation
algorithms.Comment: 25 pages, NeurIPS 2023 Track on Datasets and Benchmark
Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images
We propose Boundary-RL, a novel weakly supervised segmentation method that
utilises only patch-level labels for training. We envision the segmentation as
a boundary detection problem, rather than a pixel-level classification as in
previous works. This outlook on segmentation may allow for boundary delineation
under challenging scenarios such as where noise artefacts may be present within
the region-of-interest (ROI) boundaries, where traditional pixel-level
classification-based weakly supervised methods may not be able to effectively
segment the ROI. Particularly of interest, ultrasound images, where intensity
values represent acoustic impedance differences between boundaries, may also
benefit from the boundary delineation approach. Our method uses reinforcement
learning to train a controller function to localise boundaries of ROIs using a
reward derived from a pre-trained boundary-presence classifier. The classifier
indicates when an object boundary is encountered within a patch, as the
controller modifies the patch location in a sequential Markov decision process.
The classifier itself is trained using only binary patch-level labels of object
presence, which are the only labels used during training of the entire boundary
delineation framework, and serves as a weak signal to inform the boundary
delineation. The use of a controller function ensures that a sliding window
over the entire image is not necessary. It also prevents possible
false-positive or -negative cases by minimising number of patches passed to the
boundary-presence classifier. We evaluate our proposed approach for a
clinically relevant task of prostate gland segmentation on trans-rectal
ultrasound images. We show improved performance compared to other tested weakly
supervised methods, using the same labels e.g., multiple instance learning.Comment: Accepted to MICCAI Workshop MLMI 2023 (14th International Conference
on Machine Learning in Medical Imaging
DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems
While there have been a number of remarkable breakthroughs in machine
learning (ML), much of the focus has been placed on model development. However,
to truly realize the potential of machine learning in real-world settings,
additional aspects must be considered across the ML pipeline. Data-centric AI
is emerging as a unifying paradigm that could enable such reliable end-to-end
pipelines. However, this remains a nascent area with no standardized framework
to guide practitioners to the necessary data-centric considerations or to
communicate the design of data-centric driven ML systems. To address this gap,
we propose DC-Check, an actionable checklist-style framework to elicit
data-centric considerations at different stages of the ML pipeline: Data,
Training, Testing, and Deployment. This data-centric lens on development aims
to promote thoughtfulness and transparency prior to system development.
Additionally, we highlight specific data-centric AI challenges and research
opportunities. DC-Check is aimed at both practitioners and researchers to guide
day-to-day development. As such, to easily engage with and use DC-Check and
associated resources, we provide a DC-Check companion website
(https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an
updated resource as methods and tooling evolve over time.Comment: Main paper: 11 pages, supplementary & case studies follo
Boosting Backdoor Attack with A Learnable Poisoning Sample Selection Strategy
Data-poisoning based backdoor attacks aim to insert backdoor into models by
manipulating training datasets without controlling the training process of the
target model. Existing attack methods mainly focus on designing triggers or
fusion strategies between triggers and benign samples. However, they often
randomly select samples to be poisoned, disregarding the varying importance of
each poisoning sample in terms of backdoor injection. A recent selection
strategy filters a fixed-size poisoning sample pool by recording forgetting
events, but it fails to consider the remaining samples outside the pool from a
global perspective. Moreover, computing forgetting events requires significant
additional computing resources. Therefore, how to efficiently and effectively
select poisoning samples from the entire dataset is an urgent problem in
backdoor attacks.To address it, firstly, we introduce a poisoning mask into the
regular backdoor training loss. We suppose that a backdoored model training
with hard poisoning samples has a more backdoor effect on easy ones, which can
be implemented by hindering the normal training process (\ie, maximizing loss
\wrt mask). To further integrate it with normal training process, we then
propose a learnable poisoning sample selection strategy to learn the mask
together with the model parameters through a min-max optimization.Specifically,
the outer loop aims to achieve the backdoor attack goal by minimizing the loss
based on the selected samples, while the inner loop selects hard poisoning
samples that impede this goal by maximizing the loss. After several rounds of
adversarial training, we finally select effective poisoning samples with high
contribution. Extensive experiments on benchmark datasets demonstrate the
effectiveness and efficiency of our approach in boosting backdoor attack
performance
A Cooperative Game Approach
νμλ
Όλ¬Έ (μμ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ°μ
곡νκ³Ό, 2021. 2. μ΄λμ£Ό.As machine learning thrives in both academia and industry at the moment, data plays a salient role in training and validating machines. Meanwhile, few works have been developed on the economic evaluation of the data in data exchange market. The contribution of our work is two-fold. First, we take advantage of semi-values from cooperative game theory to model revenue distribution problem. Second, we construct a model consisting of provider, firm, and market while considering the privacy and fairness of machine learning. We showed Banzhaf value could be a reliable alternative to Shapley value in calculating the contribution of each datum. Also, we formulate the firms revenue maximization problem and present numerical analysis in the case of binary classifier with classical data examples. By assuming the firm only uses high quality data, we analyze its behavior in four different scenarios varying the datas fairness and compensating cost for data providers privacy. It turned out that the Banzhaf value is more sensitive to the fairness of data than the Shapley value. We analyzed the maximum revenue proportion which the firm gives away to data providers, as well as the range of number of data the firm would acquire.κΈ°κ³νμ΅μ΄ νμ¬ μ΄λ‘ κ³Ό μ€μν μ μ© λͺ¨λμμ λ°μ ν¨μ λ°λΌ λ°μ΄ν°λ μΈκ³΅μ§λ₯ λͺ¨λΈμ νλ ¨νκ³ κ²μ¦νλ λ° μ€μν μν μ νκ³ μλ€. ννΈ, λ°μ΄ν° κ΅ν μμ₯μμ λ°μ΄ν°μ κ²½μ μ± νκ°μ λν μ°κ΅¬λ μ΄κΈ° λ¨κ³μ΄λ€. λ³Έ λ
Όλ¬Έμ κΈ°μ¬λ λ κ°μ§ κ΄μ μμ μ κ·Όν μ μλ€. 첫째, νλ κ²μ μ΄λ‘ μ κ°λ
μΈ semi-valueλ₯Ό λͺ¨λΈ μμ΅ λΆλ°° λ¬Έμ μ νμ©νλ€. λμ§Έ, μΈκ³΅μ§λ₯ λͺ¨λΈμ 곡μ μ±κ³Ό κ°μΈμ 보보νΈμ±μ κ³ λ €ν λ°μ΄ν° μ 곡μ, κΈ°μ
, μμ₯μΌλ‘ ꡬμ±λ λͺ¨λΈμ μ μνλ€. λ³Έ μ°κ΅¬μμ Banzhaf κ°μ κ° λ°μ΄ν°μ κΈ°μ¬λλ₯Ό κ³μ°ν λ Shapley κ°μ λμμ΄ λ μ μμμ νμΈνμλ€. λν νμ¬μ μμ΅ κ·Ήλν λ¬Έμ λ₯Ό λͺ¨λΈλ§νμκ³ , μΆκ°μ μΌλ‘ λ°μ΄ν° μμ λ₯Ό μ¬μ©νμ¬ μ΄μ§ λΆλ₯ λͺ¨λΈμ κ²½μ° μμΉ λΆμμ μ μνμλ€. μ΄λ₯Ό ν΅ν΄, Banzhaf κ°μ Shapley κ°λ³΄λ€ λ°μ΄ν°μ 곡μ μ±μ λ λ―Όκ°νλ€λ κ²μ νμΈνμλ€. λμκ° κΈ°μ
μ΄ κ³ νμ§ λ°μ΄ν°λ§μ μ¬μ©νλ€λ κ°μ νμ λ°μ΄ν°μ 곡μ μ±κ³Ό λ°μ΄ν° μ 곡μμ κ°μΈμ 보μ λν 보μλΉμ©μ λ¬λ¦¬νλ λ€ κ°μ§ μλ리μ€μμ κΈ°μ
μ νλμ λΆμνμλ€. κΈ°μ
μ λ°μ΄ν°κ° 곡μ ν μλ‘ λ°μ΄ν° μ 곡μμκ² λ ν° μμ΅μ 보μ₯ν΄μ£Όμκ³ , κ³ μ λΉμ©μ΄ μμμ§μλ‘ κ°λ³λΉμ©μ ν΅ν΄μ λ°μ΄ν° μ 곡μμκ² μμ΅μ λλ μ£Όλ κ²μ νμΈνμλ€.Chapter 1 Introduction 1
1.1 Research Background 1
1.2 Problem Description 2
1.3 Organization of the Thesis 3
Chapter 2 Literature Review 4
2.1 Fair Machine Learning 4
2.2 Private Machine Learning 5
2.3 Data Valuation 6
2.3.1 Dataset Price Estimation 6
2.3.2 Equitable Price Estimation 7
Chapter 3 Data Market Model 8
3.1 Basic Assumptions and Model Settings 8
3.2 Firms Profit Maximizing Problem 10
3.3 Data Valuation 12
3.4 Binary Classification Setting 14
Chapter 4 Analysis 17
4.1 Semi-value Approximation 17
4.1.1 Convergence Analysis 17
4.1.2 Group Data Calculation 20
4.2 Binary Classification 22
4.2.1 Parameter Analysis 22
4.2.2 Scenario Analysis 24
4.2.2.1 Description 24
4.2.2.2 Synthetic Data 25
4.2.2.3 Shapley Value Based Valuation 26
4.2.2.4 Banzhaf Value Based Valuation 28
4.2.2.5 Comparative Analysis 30
4.3 Data Pricing 33
Chapter 5 Conclusion 35
Bibliography 38
κ΅λ¬Έμ΄λ‘ 43Maste
HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks
The behaviors of deep neural networks (DNNs) are notoriously resistant to
human interpretations. In this paper, we propose Hypergradient Data Relevance
Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of
their training data. Existing approaches generally estimate data contributions
around the final model parameters and ignore how the training data shape the
optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the
weights of training data, HYDRA assesses the contribution of training data
toward test data points throughout the training trajectory. In order to
accelerate computation, we remove the Hessian from the calculation and prove
that, under moderate conditions, the approximation error is bounded.
Corroborating this theoretical claim, empirical results indicate the error is
indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms
influence functions in accurately estimating data contribution and detecting
noisy data labels. The source code is available at
https://github.com/cyyever/aaai_hydra_8686