26 research outputs found

    A framework for the metrification of input image quality in deep networks

    Get PDF
    Deep Neural Networks (DNNs) are critical for real-time imaging applications including autonomous vehicles. DNNs are often trained and validated with images that originate from a limited number of cameras, each of which has its own hardware and image signal processing (ISP) characteristics. However, in most real-time embedded systems, the input images come from a variety of cameras with different ISP pipelines, and often include perturbations due to a variety of scene conditions. Data augmentation methods are commonly exploited to enhance the robustness of such systems. Alternatively, methods are employed to detect input images that are unfamiliar to the trained networks, including out of distribution detection. Despite these efforts DNNs remain widely systems with operational boundaries that cannot be easily defined. One reason is that, while training and benchmark image datasets include samples with a variety of perturbations, there is a lack of research in the areas of metrification of input image quality suitable to DNNs and a universal method to relate quality to DNN performance using meaningful quality metrics. This paper addresses this lack of metrification specific to DNNs systems and introduces a framework that uses systematic modification of image quality attributes and relate input image quality to DNN performance

    MixRL: Data Mixing Augmentation for Regression using Reinforcement Learning

    Full text link
    Data augmentation is becoming essential for improving regression accuracy in critical applications including manufacturing and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification rely on the key assumption that linearity holds among training examples, which is reasonable if the label space is discrete, but has limitations when the label space is continuous as in regression. We show that mixing examples that either have a large data or label distance may have an increasingly-negative effect on model performance. Hence, we use the stricter assumption that linearity only holds within certain data or label distances for regression where the degree may vary by each example. We then propose MixRL, a data augmentation meta learning framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a small validation set. MixRL achieves these objectives using Monte Carlo policy gradient reinforcement learning. Our experiments conducted both on synthetic and real datasets show that MixRL significantly outperforms state-of-the-art data augmentation baselines. MixRL can also be integrated with other classification Mixup techniques for better results.Comment: 15 pages, 9 figures, 7 table

    OpenDataVal: a Unified Benchmark for Data Valuation

    Full text link
    Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.Comment: 25 pages, NeurIPS 2023 Track on Datasets and Benchmark

    Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

    Full text link
    We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.Comment: Accepted to MICCAI Workshop MLMI 2023 (14th International Conference on Machine Learning in Medical Imaging

    DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

    Full text link
    While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline. Data-centric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems. To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities. DC-Check is aimed at both practitioners and researchers to guide day-to-day development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.Comment: Main paper: 11 pages, supplementary & case studies follo

    Boosting Backdoor Attack with A Learnable Poisoning Sample Selection Strategy

    Full text link
    Data-poisoning based backdoor attacks aim to insert backdoor into models by manipulating training datasets without controlling the training process of the target model. Existing attack methods mainly focus on designing triggers or fusion strategies between triggers and benign samples. However, they often randomly select samples to be poisoned, disregarding the varying importance of each poisoning sample in terms of backdoor injection. A recent selection strategy filters a fixed-size poisoning sample pool by recording forgetting events, but it fails to consider the remaining samples outside the pool from a global perspective. Moreover, computing forgetting events requires significant additional computing resources. Therefore, how to efficiently and effectively select poisoning samples from the entire dataset is an urgent problem in backdoor attacks.To address it, firstly, we introduce a poisoning mask into the regular backdoor training loss. We suppose that a backdoored model training with hard poisoning samples has a more backdoor effect on easy ones, which can be implemented by hindering the normal training process (\ie, maximizing loss \wrt mask). To further integrate it with normal training process, we then propose a learnable poisoning sample selection strategy to learn the mask together with the model parameters through a min-max optimization.Specifically, the outer loop aims to achieve the backdoor attack goal by minimizing the loss based on the selected samples, while the inner loop selects hard poisoning samples that impede this goal by maximizing the loss. After several rounds of adversarial training, we finally select effective poisoning samples with high contribution. Extensive experiments on benchmark datasets demonstrate the effectiveness and efficiency of our approach in boosting backdoor attack performance

    A Cooperative Game Approach

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2021. 2. 이덕주.As machine learning thrives in both academia and industry at the moment, data plays a salient role in training and validating machines. Meanwhile, few works have been developed on the economic evaluation of the data in data exchange market. The contribution of our work is two-fold. First, we take advantage of semi-values from cooperative game theory to model revenue distribution problem. Second, we construct a model consisting of provider, firm, and market while considering the privacy and fairness of machine learning. We showed Banzhaf value could be a reliable alternative to Shapley value in calculating the contribution of each datum. Also, we formulate the firms revenue maximization problem and present numerical analysis in the case of binary classifier with classical data examples. By assuming the firm only uses high quality data, we analyze its behavior in four different scenarios varying the datas fairness and compensating cost for data providers privacy. It turned out that the Banzhaf value is more sensitive to the fairness of data than the Shapley value. We analyzed the maximum revenue proportion which the firm gives away to data providers, as well as the range of number of data the firm would acquire.κΈ°κ³„ν•™μŠ΅μ΄ ν˜„μž¬ 이둠과 μ‹€μƒν™œ 적용 λͺ¨λ‘μ—μ„œ λ°œμ „ν•¨μ— 따라 λ°μ΄ν„°λŠ” 인곡지λŠ₯ λͺ¨λΈμ„ ν›ˆλ ¨ν•˜κ³  κ²€μ¦ν•˜λŠ” 데 μ€‘μš”ν•œ 역할을 ν•˜κ³  μžˆλ‹€. ν•œνŽΈ, 데이터 κ΅ν™˜ μ‹œμž₯μ—μ„œ λ°μ΄ν„°μ˜ κ²½μ œμ„± 평가에 λŒ€ν•œ μ—°κ΅¬λŠ” 초기 단계이닀. λ³Έ λ…Όλ¬Έμ˜ κΈ°μ—¬λŠ” 두 가지 κ΄€μ μ—μ„œ μ ‘κ·Όν•  수 μžˆλ‹€. 첫째, ν˜‘λ™ κ²Œμž„ 이둠의 κ°œλ…μΈ semi-valueλ₯Ό λͺ¨λΈ 수읡 λΆ„λ°° λ¬Έμ œμ— ν™œμš©ν•œλ‹€. λ‘˜μ§Έ, 인곡지λŠ₯ λͺ¨λΈμ˜ 곡정성과 κ°œμΈμ •λ³΄λ³΄ν˜Έμ„±μ„ κ³ λ €ν•œ 데이터 제곡자, κΈ°μ—…, μ‹œμž₯으둜 κ΅¬μ„±λœ λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. λ³Έ μ—°κ΅¬μ—μ„œ Banzhaf 값은 각 λ°μ΄ν„°μ˜ 기여도λ₯Ό 계산할 λ•Œ Shapley κ°’μ˜ λŒ€μ•ˆμ΄ 될 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€. λ˜ν•œ νšŒμ‚¬μ˜ 수읡 κ·ΉλŒ€ν™” 문제λ₯Ό λͺ¨λΈλ§ν•˜μ˜€κ³ , μΆ”κ°€μ μœΌλ‘œ 데이터 예제λ₯Ό μ‚¬μš©ν•˜μ—¬ 이진 λΆ„λ₯˜ λͺ¨λΈμ˜ 경우 수치 뢄석을 μ œμ‹œν•˜μ˜€λ‹€. 이λ₯Ό 톡해, Banzhaf 값은 Shapley 값보닀 λ°μ΄ν„°μ˜ 곡정성에 더 λ―Όκ°ν•˜λ‹€λŠ” 것을 ν™•μΈν•˜μ˜€λ‹€. λ‚˜μ•„κ°€ 기업이 κ³ ν’ˆμ§ˆ λ°μ΄ν„°λ§Œμ„ μ‚¬μš©ν•œλ‹€λŠ” κ°€μ •ν•˜μ— λ°μ΄ν„°μ˜ 곡정성과 데이터 제곡자의 κ°œμΈμ •λ³΄μ— λŒ€ν•œ λ³΄μƒλΉ„μš©μ„ λ‹¬λ¦¬ν•˜λŠ” λ„€ 가지 μ‹œλ‚˜λ¦¬μ˜€μ—μ„œ κΈ°μ—…μ˜ 행동을 λΆ„μ„ν•˜μ˜€λ‹€. 기업은 데이터가 κ³΅μ •ν• μˆ˜λ‘ 데이터 μ œκ³΅μžμ—κ²Œ 더 큰 μˆ˜μ΅μ„ 보μž₯ν•΄μ£Όμ—ˆκ³ , κ³ μ •λΉ„μš©μ΄ μž‘μ•„μ§ˆμˆ˜λ‘ κ°€λ³€λΉ„μš©μ„ ν†΅ν•΄μ„œ 데이터 μ œκ³΅μžμ—κ²Œ μˆ˜μ΅μ„ λ‚˜λˆ μ£ΌλŠ” 것을 ν™•μΈν•˜μ˜€λ‹€.Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Problem Description 2 1.3 Organization of the Thesis 3 Chapter 2 Literature Review 4 2.1 Fair Machine Learning 4 2.2 Private Machine Learning 5 2.3 Data Valuation 6 2.3.1 Dataset Price Estimation 6 2.3.2 Equitable Price Estimation 7 Chapter 3 Data Market Model 8 3.1 Basic Assumptions and Model Settings 8 3.2 Firms Profit Maximizing Problem 10 3.3 Data Valuation 12 3.4 Binary Classification Setting 14 Chapter 4 Analysis 17 4.1 Semi-value Approximation 17 4.1.1 Convergence Analysis 17 4.1.2 Group Data Calculation 20 4.2 Binary Classification 22 4.2.1 Parameter Analysis 22 4.2.2 Scenario Analysis 24 4.2.2.1 Description 24 4.2.2.2 Synthetic Data 25 4.2.2.3 Shapley Value Based Valuation 26 4.2.2.4 Banzhaf Value Based Valuation 28 4.2.2.5 Comparative Analysis 30 4.3 Data Pricing 33 Chapter 5 Conclusion 35 Bibliography 38 ꡭ문초둝 43Maste

    HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks

    Full text link
    The behaviors of deep neural networks (DNNs) are notoriously resistant to human interpretations. In this paper, we propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of their training data. Existing approaches generally estimate data contributions around the final model parameters and ignore how the training data shape the optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the weights of training data, HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In order to accelerate computation, we remove the Hessian from the calculation and prove that, under moderate conditions, the approximation error is bounded. Corroborating this theoretical claim, empirical results indicate the error is indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels. The source code is available at https://github.com/cyyever/aaai_hydra_8686
    corecore