1,468 research outputs found
Multiobjective parsimony enforcement for superior generalisation performance
Program Bloat - phenomenon of ever-increasing program size during a GP run - is a recognised and widespread problem. Traditional techniques to combat program bloat are program size limitations of parsimony pressure (penalty functions). These techniques suffer from a number of problems, in particular their reliance on parameters whose optimal values it is difficult to a priori determine. In this paper, we introduce POPE-GP, a system that makes use of the NSGA-II multiobjective evolutionary algorithm as an alternative, parameter-free technique for eliminating program bloat. We test it on a classification problem and find that while vastly reducing program size, it does improve generalisation performance
CES-480 Covariant Parsimony Pressure for Genetic Programming
The parsimony pressure method is perhaps the simplest and most frequently used method to control bloat in genetic programming. In this paper we ?rst reconsider the size evolution equation for genetic programming developed in [24] and rewrite it in a form that shows its direct relationship to Price's theorem. We then use this new formulation to derive theoretical results that show how to practically and optimally set the parsimony coe?cient dynamically during a run so as to achieve complete control over the growth of the programs in a population. Experimental results con?rm the e?ectiveness of the method, as we are able to tightly control the average program size under a variety of conditions. These include such unusual cases as dynamically varying target sizes such that the mean program size is allowed to grow during some phases of a run, while being forced to shrink in others
Analyzing Inefficiency Using a Frontier Search Approach
Efficiency measurement naturally requires the definition of a frontier as a benchmark indicating efficiency. Usually a measure reflecting the distance of a data point to the frontier indicates the level of efficiency. One of the crucial characteristics to distinguish efficiency measurement tools is the way in which they construct the frontier. The class of deterministic and non parametric tools of constructing the frontier mainly comprises of tools associated with Data Envelopment Analysis. Coming in various flavors all DEA frontiers suffer of their piecewise construction giving rise to numerous vertices. Those vertices do not allow convenient analysis of the frontier properties such as computing elasticities and the like. In this paper we want to contribute to the class of deterministic and non parametric tools of constructing the frontier in an one output and n input setting. We suggest a new empirical approach drawing on functional search in the fashion of Koza's (1992) genetic programming. The frontier search algorithm employed evolves the functional form of the frontier and the parameters simultaneously. The frontier exhibits the neat property that it is smooth and differentiable enabling the computation of elasticities,for example. In particular we introduce both the idea and the algorithm of the frontier search procedure. We discuss the advantages and shortcomings with respect to empirical problems. The arguments brought forth in the preceding sections are illustrated by the investigation of an artificial example.
Apprentissage statistique et programmation génétique: la croissance du code est-elle inévitable ?
Universal Consistency, the convergence to the minimum possible error rate in learning through genetic programming (GP), and Code bloat, the excessive increase of code size, are important issues in GP. This paper proposes a theoretical analysis of universal consistency and code bloat in the framework of symbolic regression in GP, from the viewpoint of Statistical Learning Theory, a well grounded mathematical toolbox for Machine Learning. Two kinds of bloat must be distinguished in that context, depending whether the target function has finite description length or not. Then, the Vapnik-Chervonenkis dimension of programs is computed, and we prove that a parsimonious fitness ensures Universal Consistency (i.e. the fact that the solution minimizing the empirical error does converge to the best possible error when the number of examples goes to infinity). However, it is proved that the standard method consisting in choosing a maximal program size depending on the number of examples might still result in programs of infinitely increasing size with their accuracy; a fitness biased by parsimony pressure is proposed. This fitness avoids unnecessary bloat while nevertheless preserving the Universal Consistency
Universal Consistency and Bloat in GP
In this paper, we provide an analysis of Genetic Programming (GP) from the Statistical Learning Theory viewpoint in the scope of symbolic regression. Firstly, we are interested in Universal Consistency, i.e. the fact that the solution minimizing the empirical error does converge to the best possible error when the number of examples goes to infinity, and secondly, we focus our attention on the uncontrolled growth of program length (i.e. bloat), which is a well-known problem in GP. Results show that (1) several kinds of code bloats may be identified and that (2) Universal consistency can be obtained as well as avoiding bloat under some con- ditions. We conclude by describing an ad hoc method that makes it possible simultaneously to avoid bloat and to ensure universal consistency
EDEN: Evolutionary Deep Networks for Efficient Machine Learning
Deep neural networks continue to show improved performance with increasing
depth, an encouraging trend that implies an explosion in the possible
permutations of network architectures and hyperparameters for which there is
little intuitive guidance. To address this increasing complexity, we propose
Evolutionary DEep Networks (EDEN), a computationally efficient
neuro-evolutionary algorithm which interfaces to any deep neural network
platform, such as TensorFlow. We show that EDEN evolves simple yet successful
architectures built from embedding, 1D and 2D convolutional, max pooling and
fully connected layers along with their hyperparameters. Evaluation of EDEN
across seven image and sentiment classification datasets shows that it reliably
finds good networks -- and in three cases achieves state-of-the-art results --
even on a single GPU, in just 6-24 hours. Our study provides a first attempt at
applying neuro-evolution to the creation of 1D convolutional networks for
sentiment analysis including the optimisation of the embedding layer.Comment: 7 pages, 3 figures, 3 tables and see video
https://vimeo.com/23451009
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
Population Subset Selection for the Use of a Validation Dataset for Overfitting Control in Genetic Programming
[Abstract] Genetic Programming (GP) is a technique which is able to solve different problems through the evolution of mathematical expressions. However, in order to be applied, its tendency to overfit the data is one of its main issues. The use of a validation dataset is a common alternative to prevent overfitting in many Machine Learning (ML) techniques, including GP. But, there is one key point which differentiates GP and other ML techniques: instead of training a single model, GP evolves a population of models. Therefore, the use of the validation dataset has several possibilities because any of those evolved models could be evaluated. This work explores the possibility of using the validation dataset not only on the training-best individual but also in a subset with the training-best individuals of the population. The study has been conducted with 5 well-known databases performing regression or classification tasks. In most of the cases, the results of the study point out to an improvement when the validation dataset is used on a subset of the population instead of only on the training-best individual, which also induces a reduction on the number of nodes and, consequently, a lower complexity on the expressions.Xunta de Galicia; ED431G/01Xunta de Galicia; ED431D 2017/16Xunta de Galicia; ED431C 2018/49Xunta de Galicia; ED431D 2017/23Instituto de Salud Carlos III; PI17/0182
- …