thesis

Towards Interpretability and Robustness of Machine Learning Models

Abstract

Modern machine learning models can be difficult to probe and understand after they have been trained. This is a major problem for the field, with consequences for trustworthiness, diagnostics, debugging, robustness, and a range of other engineering and human interaction issues surrounding the deployment of a model. Another problem of modern machine learning models is their vulnerability to small adversarial perturbations to the input, which incurs a security risk when they are applied to critical areas.In this thesis, we develop systematic and efficient tools for interpreting machine learning models and evaluating their adversarial robustness. Part I focuses on model interpretation. We derive an efficient feature scoring method by exploiting the graph structure in data. We also develop a learning-based method under an information-based framework. As an attempt to leverage prior knowledge about what constitutes a satisfying interpretation in a given domain, we propose a systematic approach to exploiting syntactic constituency structure by leveraging a parse tree for interpretation of models in the setting of linguistic data. Part II focuses on the evaluation of adversarial robustness. We first propose a probabilistic framework for generating adversarial examples on discrete data, and develop two algorithms to implement it. We also introduce a novel attack method in the setting where the attacker has access to model decisions alone. We investigate the robustness of various machine learning models and existing defense mechanisms under the proposed attack method. In Part III, we build a connection between the two fields by developing a method for detecting adversarial examples via tools in model interpretation

    Similar works