953 research outputs found

    Where are we now? A large benchmark study of recent symbolic regression methods

    Full text link
    In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.Comment: 8 pages, 4 figures. GECCO 201

    Differentiable Genetic Programming for High-dimensional Symbolic Regression

    Full text link
    Symbolic regression (SR) is the process of discovering hidden relationships from data with mathematical expressions, which is considered an effective way to reach interpretable machine learning (ML). Genetic programming (GP) has been the dominator in solving SR problems. However, as the scale of SR problems increases, GP often poorly demonstrates and cannot effectively address the real-world high-dimensional problems. This limitation is mainly caused by the stochastic evolutionary nature of traditional GP in constructing the trees. In this paper, we propose a differentiable approach named DGP to construct GP trees towards high-dimensional SR for the first time. Specifically, a new data structure called differentiable symbolic tree is proposed to relax the discrete structure to be continuous, thus a gradient-based optimizer can be presented for the efficient optimization. In addition, a sampling method is proposed to eliminate the discrepancy caused by the above relaxation for valid symbolic expressions. Furthermore, a diversification mechanism is introduced to promote the optimizer escaping from local optima for globally better solutions. With these designs, the proposed DGP method can efficiently search for the GP trees with higher performance, thus being capable of dealing with high-dimensional SR. To demonstrate the effectiveness of DGP, we conducted various experiments against the state of the arts based on both GP and deep neural networks. The experiment results reveal that DGP can outperform these chosen peer competitors on high-dimensional regression benchmarks with dimensions varying from tens to thousands. In addition, on the synthetic SR problems, the proposed DGP method can also achieve the best recovery rate even with different noisy levels. It is believed this work can facilitate SR being a powerful alternative to interpretable ML for a broader range of real-world problems

    Mining Explicit and Implicit Relationships in Data Using Symbolic Regression

    Full text link
    Identification of implicit and explicit relations within observed data is a generic problem commonly encountered in several domains including science, engineering, finance, and more. It forms the core component of data analytics, a process of discovering useful information from data sets that are potentially huge and otherwise incomprehensible. In industries, such information is often instrumental for profitable decision making, whereas in science and engineering it is used to build empirical models, propose new or verify existing theories and explain natural phenomena. In recent times, digital and internet based technologies have proliferated, making it viable to generate and collect large amount of data at low cost. This inturn has resulted in an ever growing need for methods to analyse and draw interpretations from such data quickly and reliably. With this overarching goal, this thesis attempts to make contributions towards developing accurate and efficient methods for discovering such relations through evolutionary search, a method commonly referred to as Symbolic Regression (SR). A data set of input variables x and a corresponding observed response y is given. The aim is to find an explicit function y = f (x) or an implicit function f (x, y) = 0, which represents the data set. While seemingly simple, the problem is challenging for several reasons. Some of the conventional regression methods try to “guess” a functional form such as linear/quadratic/polynomial, and attempt to do a curve-fitting of the data to the equation, which may limit the possibility of discovering more complex relations, if they exist. On the other hand, there are meta-modelling techniques such as response surface method, Kriging, etc., that model the given data accurately, but provide a “black-box” predictor instead of an expression. Such approximations convey little or no insights about how the variables and responses are dependent on each other, or their relative contribution to the output. SR attempts to alleviate the above two extremes by providing a structure which evolves mathematical expressions instead of assuming them. Thus, it is flexible enough to represent the data, but at the same time provides useful insights instead of a black-box predictor. SR can be categorized as part of Explainable Artificial Intelligence and can contribute to Trustworthy Artificial Intelligence. The works proposed in this thesis aims to integrate the concept of “semantics” deeper into Genetic Programming (GP) and Evolutionary Feature Synthesis, which are the two algorithms usually employed for conducting SR. The semantics will be integrated into well-known components of the algorithms such as compactness, diversity, recombination, constant optimization, etc. The main contribution of this thesis is the proposal of two novel operators to generate expressions based on Linear Programming and Mixed Integer Programming with the aim of controlling the length of the discovered expressions without compromising on the accuracy. In the experiments, these operators are proven to be able to discover expressions with better accuracy and interpretability on many explicit and implicit benchmarks. Moreover, some applications of SR on real-world data sets are shown to demonstrate the practicality of the proposed approaches. Besides, in related to practical problems, how GP can be applied to effectively solve the Resource Constrained Scheduling Problems is also presented

    Nonlinear Dynamic System Identification and Model Predictive Control Using Genetic Programming

    Get PDF
    During the last century, a lot of developments have been made in research of complex nonlinear process control. As a powerful control methodology, model predictive control (MPC) has been extensively applied to chemical industrial applications. Core to MPC is a predictive model of the dynamics of the system being controlled. Most practical systems exhibit complex nonlinear dynamics, which imposes big challenges in system modelling. Being able to automatically evolve both model structure and numeric parameters, Genetic Programming (GP) shows great potential in identifying nonlinear dynamic systems. This thesis is devoted to GP based system identification and model-based control of nonlinear systems. To improve the generalization ability of GP models, a series of experiments that use semantic-based local search within a multiobjective GP framework are reported. The influence of various ways of selecting target subtrees for local search as well as different methods for performing that search were investigated; a comparison with the Random Desired Operator (RDO) of Pawlak et al. was made by statistical hypothesis testing. Compared with the corresponding baseline GP algorithms, models produced by a standard steady state or generational GP followed by a carefully-designed single-objective GP implementing semantic-based local search are statistically more accurate and with smaller (or equal) tree size, compared with the RDO-based GP algorithms. Considering the practical application, how to correctly and efficiently apply an evolved GP model to other larger systems is a critical research concern. Currently, the replication of GP models is normally done by repeating other’s work given the necessary algorithm parameters. However, due to the empirical and stochastic nature of GP, it is difficult to completely reproduce research findings. An XML-based standard file format, named Genetic Programming Markup Language (GPML), is proposed for the interchange of GP trees. A formal definition of this standard and details of implementation are described. GPML provides convenience and modularity for further applications based on GP models. The large-scale adoption of MPC in buildings is not economically viable due to the time and cost involved in designing and adjusting predictive models by expert control engineers. A GP-based control framework is proposed for automatically evolving dynamic nonlinear models for the MPC of buildings. An open-loop system identification was conducted using the data generated by a building simulator, and the obtained GP model was then employed to construct the predictive model for the MPC. The experimental result shows GP is able to produce models that allow the MPC of building to achieve the desired temperature band in a single zone space

    Taylor Genetic Programming for Symbolic Regression

    Get PDF
    Genetic programming (GP) is a commonly used approach to solve symbolic regression (SR) problems. Compared with the machine learning or deep learning methods that depend on the pre-defined model and the training dataset for solving SR problems, GP is more focused on finding the solution in a search space. Although GP has good performance on large-scale benchmarks, it randomly transforms individuals to search results without taking advantage of the characteristics of the dataset. So, the search process of GP is usually slow, and the final results could be unstable. To guide GP by these characteristics, we propose a new method for SR, called Taylor genetic programming (TaylorGP). TaylorGP leverages a Taylor polynomial to approximate the symbolic equation that fits the dataset. It also utilizes the Taylor polynomial to extract the features of the symbolic equation: low order polynomial discrimination, variable separability, boundary, monotonic, and parity. GP is enhanced by these Taylor polynomial techniques. Experiments are conducted on three kinds of benchmarks: classical SR, machine learning, and physics. The experimental results show that TaylorGP not only has higher accuracy than the nine baseline methods, but also is faster in finding stable results

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
    • …
    corecore