159 research outputs found
Differentiable Genetic Programming for High-dimensional Symbolic Regression
Symbolic regression (SR) is the process of discovering hidden relationships
from data with mathematical expressions, which is considered an effective way
to reach interpretable machine learning (ML). Genetic programming (GP) has been
the dominator in solving SR problems. However, as the scale of SR problems
increases, GP often poorly demonstrates and cannot effectively address the
real-world high-dimensional problems. This limitation is mainly caused by the
stochastic evolutionary nature of traditional GP in constructing the trees. In
this paper, we propose a differentiable approach named DGP to construct GP
trees towards high-dimensional SR for the first time. Specifically, a new data
structure called differentiable symbolic tree is proposed to relax the discrete
structure to be continuous, thus a gradient-based optimizer can be presented
for the efficient optimization. In addition, a sampling method is proposed to
eliminate the discrepancy caused by the above relaxation for valid symbolic
expressions. Furthermore, a diversification mechanism is introduced to promote
the optimizer escaping from local optima for globally better solutions. With
these designs, the proposed DGP method can efficiently search for the GP trees
with higher performance, thus being capable of dealing with high-dimensional
SR. To demonstrate the effectiveness of DGP, we conducted various experiments
against the state of the arts based on both GP and deep neural networks. The
experiment results reveal that DGP can outperform these chosen peer competitors
on high-dimensional regression benchmarks with dimensions varying from tens to
thousands. In addition, on the synthetic SR problems, the proposed DGP method
can also achieve the best recovery rate even with different noisy levels. It is
believed this work can facilitate SR being a powerful alternative to
interpretable ML for a broader range of real-world problems
Mining Feature Relationships in Data
When faced with a new dataset, most practitioners begin by performing
exploratory data analysis to discover interesting patterns and characteristics
within data. Techniques such as association rule mining are commonly applied to
uncover relationships between features (attributes) of the data. However,
association rules are primarily designed for use on binary or categorical data,
due to their use of rule-based machine learning. A large proportion of
real-world data is continuous in nature, and discretisation of such data leads
to inaccurate and less informative association rules. In this paper, we propose
an alternative approach called feature relationship mining (FRM), which uses a
genetic programming approach to automatically discover symbolic relationships
between continuous or categorical features in data. To the best of our
knowledge, our proposed approach is the first such symbolic approach with the
goal of explicitly discovering relationships between features. Empirical
testing on a variety of real-world datasets shows the proposed method is able
to find high-quality, simple feature relationships which can be easily
interpreted and which provide clear and non-trivial insight into data.Comment: 16 pages, accepted in EuroGP '2
Gradient Information and Regularization for Gene Expression Programming to Develop Data-Driven Physics Closure Models
Learning accurate numerical constants when developing algebraic models is a
known challenge for evolutionary algorithms, such as Gene Expression
Programming (GEP). This paper introduces the concept of adaptive symbols to the
GEP framework by Weatheritt and Sandberg (2016) to develop advanced physics
closure models. Adaptive symbols utilize gradient information to learn locally
optimal numerical constants during model training, for which we investigate two
types of nonlinear optimization algorithms. The second contribution of this
work is implementing two regularization techniques to incentivize the
development of implementable and interpretable closure models. We apply
regularization to ensure small magnitude numerical constants and devise a novel
complexity metric that supports the development of low complexity models via
custom symbol complexities and multi-objective optimization. This extended
framework is employed to four use cases, namely rediscovering Sutherland's
viscosity law, developing laminar flame speed combustion models and training
two types of fluid dynamics turbulence models. The model prediction accuracy
and the convergence speed of training are improved significantly across all of
the more and less complex use cases, respectively. The two regularization
methods are essential for developing implementable closure models and we
demonstrate that the developed turbulence models substantially improve
simulations over state-of-the-art models
Priors for symbolic regression
When choosing between competing symbolic models for a data set, a human will
naturally prefer the "simpler" expression or the one which more closely
resembles equations previously seen in a similar context. This suggests a
non-uniform prior on functions, which is, however, rarely considered within a
symbolic regression (SR) framework. In this paper we develop methods to
incorporate detailed prior information on both functions and their parameters
into SR. Our prior on the structure of a function is based on a -gram
language model, which is sensitive to the arrangement of operators relative to
one another in addition to the frequency of occurrence of each operator. We
also develop a formalism based on the Fractional Bayes Factor to treat
numerical parameter priors in such a way that models may be fairly compared
though the Bayesian evidence, and explicitly compare Bayesian, Minimum
Description Length and heuristic methods for model selection. We demonstrate
the performance of our priors relative to literature standards on benchmarks
and a real-world dataset from the field of cosmology.Comment: 8+2 pages, 2 figures. Submitted to The Genetic and Evolutionary
Computation Conference (GECCO) 2023 Workshop on Symbolic Regressio
Unconstrained Learning Machines
With the use of information technology in industries, a new need has arisen in analyzing large scale data sets and automating data analysis that was once performed by human intuition and simple analog processing machines. The new generation of computer programs now has to outperform their predecessors in detecting complex and non-trivial patterns buried in data warehouses. Improved Machines Learning (ML) techniques such as Neural Networks (NNs) and Support Vector Machines (SVMs) have shown remarkable performances on supervised learning problems for the past couple of decades (e.g. anomaly detection, classification and identification, interpolation and extrapolation, etc.).Nevertheless, many such techniques have ill-conditioned structures which lack adaptability for processing exotic data or very large amounts of data. Some techniques cannot even process data in an on-line fashion. Furthermore, as the processing power of computers increases, there is a pressing need for ML algorithms to perform supervised learning tasks in less time than previously required over even larger sets of data, which means that time and memory complexities of these algorithms must be improved.The aims of this research is to construct an improved type of SVM-like algorithms for tasks such as nonlinear classification and interpolation that is more scalable, error-tolerant and accurate. Additionally, this family of algorithms must be able to compute solutions in a controlled timing, preferably small with respect to modern computational technologies. These new algorithms should also be versatile enough to have useful applications in engineering, meteorology or quality control.This dissertation introduces a family of SVM-based algorithms named Unconstrained Learning Machines (ULMs) which attempt to solve the robustness, scalability and timing issues of traditional supervised learning algorithms. ULMs are not based on geometrical analogies (e.g. SVMs) or on the replication of biological models (e.g. NNs). Their construction is strictly based on statistical considerations taken from the recently developed statistical learning theory. Like SVMs, ULMS are using kernel methods extensively in order to process exotic and/or non-numerical objects stored in databases and search for hidden patterns in data with tailored measures of similarities.ULMs are applied to a variety of problems in manufacturing engineering and in meteorology. The robust nonlinear nonparametric interpolation abilities of ULMs allow for the representation of sub-millimetric deformations on the surface of manufactured parts, the selection of conforming objects and the diagnostic and modeling of manufacturing processes. ULMs play a role in assimilating the system states of computational weather models, removing the intrinsic noise without any knowledge of the underlying mathematical models and helping the establishment of more accurate forecasts
- …