2,141 research outputs found

    Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation

    Full text link
    Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, a model distillation and comparison approach to audit such models. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by black-box models. We compare the student model trained with distillation to a second un-distilled transparent model trained on ground-truth outcomes, and use differences between the two models to gain insight into the black-box model. Our approach can be applied in a realistic setting, without probing the black-box model API. We demonstrate the approach on four public data sets: COMPAS, Stop-and-Frisk, Chicago Police, and Lending Club. We also propose a statistical test to determine if a data set is missing key features used to train the black-box model. Our test finds that the ProPublica data is likely missing key feature(s) used in COMPAS.Comment: Camera-ready version for AAAI/ACM AIES 2018. Data and pseudocode at https://github.com/shftan/auditblackbox. Previously titled "Detecting Bias in Black-Box Models Using Transparent Model Distillation". A short version was presented at NIPS 2017 Symposium on Interpretable Machine Learnin

    Intrinsic Evaluation of Grammatical Information within Word Embeddings

    Get PDF

    Defense against Universal Adversarial Perturbations

    Full text link
    Recent advances in Deep Learning show the existence of image-agnostic quasi-imperceptible perturbations that when applied to `any' image can fool a state-of-the-art network classifier to change its prediction about the image label. These `Universal Adversarial Perturbations' pose a serious threat to the success of Deep Learning in practice. We present the first dedicated framework to effectively defend the networks against such perturbations. Our approach learns a Perturbation Rectifying Network (PRN) as `pre-input' layers to a targeted model, such that the targeted model needs no modification. The PRN is learned from real and synthetic image-agnostic perturbations, where an efficient method to compute the latter is also proposed. A perturbation detector is separately trained on the Discrete Cosine Transform of the input-output difference of the PRN. A query image is first passed through the PRN and verified by the detector. If a perturbation is detected, the output of the PRN is used for label prediction instead of the actual image. A rigorous evaluation shows that our framework can defend the network classifiers against unseen adversarial perturbations in the real-world scenarios with up to 97.5% success rate. The PRN also generalizes well in the sense that training for one targeted network defends another network with a comparable success rate.Comment: Accepted in IEEE CVPR 201
    • …
    corecore