5 research outputs found
A Generic Approach for Reproducible Model Distillation
Model distillation has been a popular method for producing interpretable
machine learning. It uses an interpretable "student" model to mimic the
predictions made by the black box "teacher" model. However, when the student
model is sensitive to the variability of the data sets used for training even
when keeping the teacher fixed, the corresponded interpretation is not
reliable. Existing strategies stabilize model distillation by checking whether
a large enough corpus of pseudo-data is generated to reliably reproduce student
models, but methods to do so have so far been developed for a specific student
model. In this paper, we develop a generic approach for stable model
distillation based on central limit theorem for the average loss. We start with
a collection of candidate student models and search for candidates that
reasonably agree with the teacher. Then we construct a multiple testing
framework to select a corpus size such that the consistent student model would
be selected under different pseudo samples. We demonstrate the application of
our proposed approach on three commonly used intelligible models: decision
trees, falling rule lists and symbolic regression. Finally, we conduct
simulation experiments on Mammographic Mass and Breast Cancer datasets and
illustrate the testing procedure throughout a theoretical analysis with Markov
process. The code is publicly available at
https://github.com/yunzhe-zhou/GenericDistillation.Comment: 31 pages, 8 figure
Mining Explicit and Implicit Relationships in Data Using Symbolic Regression
Identification of implicit and explicit relations within observed data is a generic problem commonly encountered in several domains including science, engineering, finance, and more. It forms the core component of data analytics, a process of discovering useful information from data sets that are potentially huge and otherwise incomprehensible. In industries, such information is often instrumental for profitable decision making, whereas in science and engineering it is used to build empirical models, propose new or verify existing theories and explain natural phenomena. In recent times, digital and internet based technologies have proliferated, making it viable to generate and collect large amount of data at low cost. This inturn has resulted in an ever growing need for methods to analyse and draw interpretations from such data quickly and reliably. With this overarching goal, this thesis attempts to make contributions towards developing accurate and efficient methods for discovering such relations through evolutionary search, a method commonly referred to as Symbolic Regression (SR).
A data set of input variables x and a corresponding observed response y is given. The aim is to find an explicit function y = f (x) or an implicit function f (x, y) = 0, which represents the data set. While seemingly simple, the problem is challenging for several reasons. Some of the conventional regression methods try to âguessâ a functional form such as linear/quadratic/polynomial, and attempt to do a curve-fitting of the data to the equation, which may limit the possibility of discovering more complex relations, if they exist. On the other hand, there are meta-modelling techniques such as response surface method, Kriging, etc., that model the given data accurately, but provide a âblack-boxâ predictor instead of an expression. Such approximations convey little or no insights about how the variables and responses are dependent on each other, or their relative contribution to the output. SR attempts to alleviate the above two extremes by providing a structure which evolves mathematical expressions instead of assuming them. Thus, it is flexible enough to represent the data, but at the same time provides useful insights instead of a black-box predictor. SR can be categorized as part of Explainable Artificial Intelligence and can contribute to Trustworthy Artificial Intelligence.
The works proposed in this thesis aims to integrate the concept of âsemanticsâ deeper into Genetic Programming (GP) and Evolutionary Feature Synthesis, which are the two algorithms usually employed for conducting SR. The semantics will be integrated into well-known components of the algorithms such as compactness, diversity, recombination, constant optimization, etc. The main contribution of this thesis is the proposal of two novel operators to generate expressions based on Linear Programming and Mixed Integer Programming with the aim of controlling the length of the discovered expressions without compromising on the accuracy. In the experiments, these operators are proven to be able to discover expressions with better accuracy and interpretability on many explicit and implicit benchmarks. Moreover, some applications of SR on real-world data sets are shown to demonstrate the practicality of the proposed approaches. Besides, in related to practical problems, how GP can be applied to effectively solve the Resource Constrained Scheduling Problems is also presented