Search CORE

26 research outputs found

A framework for the metrification of input image quality in deep networks

Author: Feisal I.
Feisal I.
Psarrou A.
Psarrou A.
Triantaphillidou S.
Triantaphillidou S.
Van Zwanenberg O.
Van Zwanenberg O.
Publication venue: Society for Imaging Science and Technology
Publication date: 01/01/2023
Field of study

Deep Neural Networks (DNNs) are critical for real-time imaging applications including autonomous vehicles. DNNs are often trained and validated with images that originate from a limited number of cameras, each of which has its own hardware and image signal processing (ISP) characteristics. However, in most real-time embedded systems, the input images come from a variety of cameras with different ISP pipelines, and often include perturbations due to a variety of scene conditions. Data augmentation methods are commonly exploited to enhance the robustness of such systems. Alternatively, methods are employed to detect input images that are unfamiliar to the trained networks, including out of distribution detection. Despite these efforts DNNs remain widely systems with operational boundaries that cannot be easily defined. One reason is that, while training and benchmark image datasets include samples with a variety of perturbations, there is a lack of research in the areas of metrification of input image quality suitable to DNNs and a universal method to relate quality to DNN performance using meaningful quality metrics. This paper addresses this lack of metrification specific to DNNs systems and introduces a framework that uses systematic modification of image quality attributes and relate input image quality to DNN performance

WestminsterResearch

MixRL: Data Mixing Augmentation for Regression using Reinforcement Learning

Author: Hwang Seong-Hyeon
Whang Steven Euijong
Publication venue
Publication date: 08/10/2021
Field of study

Data augmentation is becoming essential for improving regression accuracy in critical applications including manufacturing and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification rely on the key assumption that linearity holds among training examples, which is reasonable if the label space is discrete, but has limitations when the label space is continuous as in regression. We show that mixing examples that either have a large data or label distance may have an increasingly-negative effect on model performance. Hence, we use the stricter assumption that linearity only holds within certain data or label distances for regression where the degree may vary by each example. We then propose MixRL, a data augmentation meta learning framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a small validation set. MixRL achieves these objectives using Monte Carlo policy gradient reinforcement learning. Our experiments conducted both on synthetic and real datasets show that MixRL significantly outperforms state-of-the-art data augmentation baselines. MixRL can also be integrated with other classification Mixup techniques for better results.Comment: 15 pages, 9 figures, 7 table

arXiv.org e-Print Archive

OpenDataVal: a Unified Benchmark for Data Valuation

Author: Jiang Kevin Fu
Kwon Yongchan
Liang Weixin
Zou James
Publication venue
Publication date: 13/10/2023
Field of study

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.Comment: 25 pages, NeurIPS 2023 Track on Datasets and Benchmark

arXiv.org e-Print Archive

Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

Author: Barratt Dean C.
Baum Zachary M. C.
Clarkson Matthew J.
Hu Yipeng
Saeed Shaheer U.
Stavrinides Vasilis
Yang Qianye
Yi Weixi
Publication venue
Publication date: 22/08/2023
Field of study

We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.Comment: Accepted to MICCAI Workshop MLMI 2023 (14th International Conference on Machine Learning in Medical Imaging

arXiv.org e-Print Archive

DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

Author: Imrie Fergus
Seedat Nabeel
van der Schaar Mihaela
Publication venue
Publication date: 09/11/2022
Field of study

While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline. Data-centric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems. To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities. DC-Check is aimed at both practitioners and researchers to guide day-to-day development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.Comment: Main paper: 11 pages, supplementary & case studies follo

arXiv.org e-Print Archive

Boosting Backdoor Attack with A Learnable Poisoning Sample Selection Strategy

Author: Fan Yanbo
Shen Li
Wei Shaokui
Wu Baoyuan
Zhang Mingda
Zhu Zihao
Publication venue
Publication date: 14/07/2023
Field of study

Data-poisoning based backdoor attacks aim to insert backdoor into models by manipulating training datasets without controlling the training process of the target model. Existing attack methods mainly focus on designing triggers or fusion strategies between triggers and benign samples. However, they often randomly select samples to be poisoned, disregarding the varying importance of each poisoning sample in terms of backdoor injection. A recent selection strategy filters a fixed-size poisoning sample pool by recording forgetting events, but it fails to consider the remaining samples outside the pool from a global perspective. Moreover, computing forgetting events requires significant additional computing resources. Therefore, how to efficiently and effectively select poisoning samples from the entire dataset is an urgent problem in backdoor attacks.To address it, firstly, we introduce a poisoning mask into the regular backdoor training loss. We suppose that a backdoored model training with hard poisoning samples has a more backdoor effect on easy ones, which can be implemented by hindering the normal training process (\ie, maximizing loss \wrt mask). To further integrate it with normal training process, we then propose a learnable poisoning sample selection strategy to learn the mask together with the model parameters through a min-max optimization.Specifically, the outer loop aims to achieve the backdoor attack goal by minimizing the loss based on the selected samples, while the inner loop selects hard poisoning samples that impede this goal by maximizing the loss. After several rounds of adversarial training, we finally select effective poisoning samples with high contribution. Extensive experiments on benchmark datasets demonstrate the effectiveness and efficiency of our approach in boosting backdoor attack performance

arXiv.org e-Print Archive

A Cooperative Game Approach

Author: 최문석
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 산업공학과, 2021. 2. 이덕주.As machine learning thrives in both academia and industry at the moment, data plays a salient role in training and validating machines. Meanwhile, few works have been developed on the economic evaluation of the data in data exchange market. The contribution of our work is two-fold. First, we take advantage of semi-values from cooperative game theory to model revenue distribution problem. Second, we construct a model consisting of provider, firm, and market while considering the privacy and fairness of machine learning. We showed Banzhaf value could be a reliable alternative to Shapley value in calculating the contribution of each datum. Also, we formulate the firms revenue maximization problem and present numerical analysis in the case of binary classifier with classical data examples. By assuming the firm only uses high quality data, we analyze its behavior in four different scenarios varying the datas fairness and compensating cost for data providers privacy. It turned out that the Banzhaf value is more sensitive to the fairness of data than the Shapley value. We analyzed the maximum revenue proportion which the firm gives away to data providers, as well as the range of number of data the firm would acquire.기계학습이 현재 이론과 실생활 적용 모두에서 발전함에 따라 데이터는 인공지능 모델을 훈련하고 검증하는 데 중요한 역할을 하고 있다. 한편, 데이터 교환 시장에서 데이터의 경제성 평가에 대한 연구는 초기 단계이다. 본 논문의 기여는 두 가지 관점에서 접근할 수 있다. 첫째, 협동 게임 이론의 개념인 semi-value를 모델 수익 분배 문제에 활용한다. 둘째, 인공지능 모델의 공정성과 개인정보보호성을 고려한 데이터 제공자, 기업, 시장으로 구성된 모델을 제안한다. 본 연구에서 Banzhaf 값은 각 데이터의 기여도를 계산할 때 Shapley 값의 대안이 될 수 있음을 확인하였다. 또한 회사의 수익 극대화 문제를 모델링하였고, 추가적으로 데이터 예제를 사용하여 이진 분류 모델의 경우 수치 분석을 제시하였다. 이를 통해, Banzhaf 값은 Shapley 값보다 데이터의 공정성에 더 민감하다는 것을 확인하였다. 나아가 기업이 고품질 데이터만을 사용한다는 가정하에 데이터의 공정성과 데이터 제공자의 개인정보에 대한 보상비용을 달리하는 네 가지 시나리오에서 기업의 행동을 분석하였다. 기업은 데이터가 공정할수록 데이터 제공자에게 더 큰 수익을 보장해주었고, 고정비용이 작아질수록 가변비용을 통해서 데이터 제공자에게 수익을 나눠주는 것을 확인하였다.Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Problem Description 2 1.3 Organization of the Thesis 3 Chapter 2 Literature Review 4 2.1 Fair Machine Learning 4 2.2 Private Machine Learning 5 2.3 Data Valuation 6 2.3.1 Dataset Price Estimation 6 2.3.2 Equitable Price Estimation 7 Chapter 3 Data Market Model 8 3.1 Basic Assumptions and Model Settings 8 3.2 Firms Profit Maximizing Problem 10 3.3 Data Valuation 12 3.4 Binary Classification Setting 14 Chapter 4 Analysis 17 4.1 Semi-value Approximation 17 4.1.1 Convergence Analysis 17 4.1.2 Group Data Calculation 20 4.2 Binary Classification 22 4.2.1 Parameter Analysis 22 4.2.2 Scenario Analysis 24 4.2.2.1 Description 24 4.2.2.2 Synthetic Data 25 4.2.2.3 Shapley Value Based Valuation 26 4.2.2.4 Banzhaf Value Based Valuation 28 4.2.2.5 Comparative Analysis 30 4.3 Data Pricing 33 Chapter 5 Conclusion 35 Bibliography 38 국문초록 43Maste

SNU Open Repository and Archive

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Author: Keller Frank
Narayana Gowda Shreyank
Rohrbach Marcus
Sevilla-Lara Laura
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/10/2022
Field of study

Edinburgh Research Explorer

HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks

Author: Chen Yuanyuan
Li Boyang
Miao Chunyan
Wu Pengcheng
Yu Han
Publication venue
Publication date: 24/03/2021
Field of study

The behaviors of deep neural networks (DNNs) are notoriously resistant to human interpretations. In this paper, we propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of their training data. Existing approaches generally estimate data contributions around the final model parameters and ignore how the training data shape the optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the weights of training data, HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In order to accelerate computation, we remove the Hessian from the calculation and prove that, under moderate conditions, the approximation error is bounded. Corroborating this theoretical claim, empirical results indicate the error is indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels. The source code is available at https://github.com/cyyever/aaai_hydra_8686

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications