335 research outputs found

    Temporal Feature Selection with Symbolic Regression

    Get PDF
    Building and discovering useful features when constructing machine learning models is the central task for the machine learning practitioner. Good features are useful not only in increasing the predictive power of a model but also in illuminating the underlying drivers of a target variable. In this research we propose a novel feature learning technique in which Symbolic regression is endowed with a ``Range Terminal\u27\u27 that allows it to explore functions of the aggregate of variables over time. We test the Range Terminal on a synthetic data set and a real world data in which we predict seasonal greenness using satellite derived temperature and snow data over a portion of the Arctic. On the synthetic data set we find Symbolic regression with the Range Terminal outperforms standard Symbolic regression and Lasso regression. On the Arctic data set we find it outperforms standard Symbolic regression, fails to beat the Lasso regression, but finds useful features describing the interaction between Land Surface Temperature, Snow, and seasonal vegetative growth in the Arctic

    Model-Based Problem Solving through Symbolic Regression via Pareto Genetic Programming.

    Get PDF
    Pareto genetic programming methodology is extended by additional generic model selection and generation strategies that (1) drive the modeling engine to creation of models of reduced non-linearity and increased generalization capabilities, and (2) improve the effectiveness of the search for robust models by goal softening and adaptive fitness evaluations. In addition to the new strategies for model development and model selection, this dissertation presents a new approach for analysis, ranking, and compression of given multi-dimensional input-response data for the purpose of balancing the information content of undesigned data sets.

    How Noisy Data Affects Geometric Semantic Genetic Programming

    Full text link
    Noise is a consequence of acquiring and pre-processing data from the environment, and shows fluctuations from different sources---e.g., from sensors, signal processing technology or even human error. As a machine learning technique, Genetic Programming (GP) is not immune to this problem, which the field has frequently addressed. Recently, Geometric Semantic Genetic Programming (GSGP), a semantic-aware branch of GP, has shown robustness and high generalization capability. Researchers believe these characteristics may be associated with a lower sensibility to noisy data. However, there is no systematic study on this matter. This paper performs a deep analysis of the GSGP performance over the presence of noise. Using 15 synthetic datasets where noise can be controlled, we added different ratios of noise to the data and compared the results obtained with those of a canonical GP. The results show that, as we increase the percentage of noisy instances, the generalization performance degradation is more pronounced in GSGP than GP. However, in general, GSGP is more robust to noise than GP in the presence of up to 10% of noise, and presents no statistical difference for values higher than that in the test bed.Comment: 8 pages, In proceedings of Genetic and Evolutionary Computation Conference (GECCO 2017), Berlin, German

    Understanding Climate-Vegetation Interactions in Global Rainforests Through a GP-Tree Analysis

    Get PDF
    The tropical rainforests are the largest reserves of terrestrial carbon and, therefore, the future of these rainforests is a question that is of immense importance in the geoscience research community. With the recent severe Amazonian droughts in 2005 and 2010 and on-going drought in the Congo region for more than two decades, there is growing concern that these forests could succumb to precipitation reduction, causing extensive carbon release and feedback to the carbon cycle. However, there is no single ecosystem model that quantifies the relationship between vegetation health in these rainforests and climatic factors. Small scale studies have used statistical correlation measure and simple linear regression to model climate-vegetation interactions, but suffer from the lack of comprehensive data representation as well as simplistic assumptions about dependency of the target on the covariates. In this paper we use genetic programming (GP) based symbolic regression for discovering equations that govern the vegetation climate dynamics in the rainforests. Expecting micro-regions within the rainforests to have unique characteristics compared to the overall general characteristics, we use a modified regression-tree based hierarchical partitioning of the space to build individual models for each partition. The discovery of these equations reveal very interesting characteristics about the Amazon and the Congo rainforests. Our method GP-tree shows that the rainforests exhibit tremendous resiliency in the face of extreme climatic events by adapting to changing conditions

    Where are we now? A large benchmark study of recent symbolic regression methods

    Full text link
    In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.Comment: 8 pages, 4 figures. GECCO 201

    Automatic Development and Adaptation of Concise Nonlinear Models for System Identification

    Get PDF
    Mathematical descriptions of natural and man-made processes are the bedrock of science, used by humans to understand, estimate, predict and control the natural and built world around them. The goal of system identification is to enable the inference of mathematical descriptions of the true behavior and dynamics of processes from their measured observations. The crux of this task is the identification of the dynamic model form (topology) in addition to its parameters. Model structures must be concise to offer insight to the user about the process in question. To that end, this dissertation proposes three methods to improve the ability of system identification to identify succinct nonlinear model structures. The first is a model structure adaptation method (MSAM) that modifies first principles models to increase their predictive ability while maintaining intelligibility. Model structure identification is achieved by this method despite the presence of parametric error through a novel means of estimating the gradient of model structure perturbations. I demonstrate MSAM\u27s ability to identify underlying nonlinear dynamic models starting from linear models in the presence of parametric uncertainty. The main contribution of this method is the ability to adapt the structure of existing models of processes such that they more closely match the process observations. The second method, known as epigenetic linear genetic programming (ELGP), conducts symbolic regression without a priori knowledge of the form of the model or its parameters. ELGP incorporates a layer of genetic regulation into genetic programming (GP) and adapts it by local search to tune the resultant model structures for accuracy and conciseness. The introduction of epigenetics is made simple by the use of a stack-based program representation. This method, tested on hundreds of dynamics problems, demonstrates the ability of epigenetic local search to improve GP by producing simpler and more accurate models. The third method relies on a multidimensional GP approach (M4GP) for solving multiclass classification problems. The proposed method uses stack-based GP to conduct nonlinear feature transformations to optimize the clustering of data according to their classes. In comparison to several state-of-the-art methods, M4GP is able to classify test data better on several real-world problems. The main contribution of M4GP is its demonstrated ability to combine the strengths of GP (e.g. nonlinear feature transformations and feature selection) with the strengths of distance-based classification. MSAM, ELGP and M4GP improve the identification of succinct nonlinear model structures for continuous dynamic processes with starting models, continuous dynamic processes without starting models, and multiclass dynamic processes without starting models, respectively. A considerable portion of this dissertation is devoted to the application of these methods to these three classes of real-world dynamic modeling problems. MSAM is applied to the restructuring of controllers to improve the closed-loop system response of nonlinear plants. ELGP is used to identify the closed-loop dynamics of an industrial scale wind turbine and to define a reduced-order model of fluid-structure interaction. Lastly, M4GP is used to identify a dynamic behavioral model of bald eagles from collected data. The methods are analyzed alongside many other state-of-the-art system identification methods in the context of model accuracy and conciseness

    Predicting Ordinary Differential Equations with Transformers

    Full text link
    We develop a transformer-based sequence-to-sequence model that recovers scalar ordinary differential equations (ODEs) in symbolic form from irregularly sampled and noisy observations of a single solution trajectory. We demonstrate in extensive empirical evaluations that our model performs better or on par with existing methods in terms of accurate recovery across various settings. Moreover, our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing law of a new observed solution in a few forward passes of the model.Comment: Published at ICML 202
    • …
    corecore