103 research outputs found
A modular genetic programming system
Genetic Programming (GP) is an evolutionary algorithm for the automatic
discovery of symbolic expressions, e.g. computer programs or mathematical
formulae, that encode solutions to a user-defined task. Recent advances in GP
systems and computer performance made it possible to successfully apply this
algorithm to real-world applications.
This work offers three main contributions to the state-of-the art in GP
systems:
(I) The documentation of RGP, a state-of-the art GP software implemented as an
extension package to the popular R environment for statistical computation and
graphics. GP and RPG are introduced both formally and with a series of tutorial
examples. As R itself, RGP is available under an open source license.
(II) A comprehensive empirical analysis of modern GP heuristics based on the
methodology of Sequential Parameter Optimization. The effects and interactions
of the most important GP algorithm parameters are analyzed and recommendations
for good parameter settings are given.
(III) Two extensive case studies based on real-world industrial applications.
The first application involves process control models in steel production,
while the second is about meta-model-based optimization of cyclone dust
separators. A comparison with traditional and modern regression methods
reveals that GP offers equal or superior performance in both applications,
with the additional benefit of understandable and easy to deploy models.
Main motivation of this work is the advancement of GP in real-world application
areas. The focus lies on a subset of application areas that are known to be
practical for GP, first of all symbolic regression and classification. It has
been written with practitioners from academia and industry in mind
Evolvability and rate of evolution in evolutionary computation
Evolvability has emerged as a research topic in both natural and computational evolution. It is a notion put forward to investigate the fundamental mechanisms that enable a system to evolve. A number of hypotheses have been proposed in modern biological research based on the examination of various mechanisms in the biosphere for their contribution to evolvability. Therefore, it is intriguing to try to transfer new discoveries from Biology to and test them in Evolutionary Computation (EC) systems, so that computational models would be improved and a better understanding of general evolutional mechanisms is achieved. -- Rate of evolution comes in different flavors in natural and computational evolution. Specifically, we distinguish the rate of fitness progression from that of genetic substitutions. The former is a common concept in EC since the ability to explicitly quantify the fitness of an evolutionary individual is one of the most important differences between computational systems and natural systems. Within the biological research community, the definition of rate of evolution varies, depending on the objects being examined such as gene sequences, proteins, tissues, etc. For instance, molecular biologists tend to use the rate of genetic substitutions to quantify how fast evolution proceeds at the genetic level. This concept of rate of evolution focuses on the evolutionary dynamics underlying fitness development, due to the inability to mathematically define fitness in a natural system. In EC, the rate of genetic substitutions suggests an unconventional and potentially powerful method to measure the rate of evolution by accessing lower levels of evolutionary dynamics. -- Central to this thesis is our new definition of rate of evolution in EC. We transfer the method of measurement of the rate of genetic substitutions from molecular biology to EC. The implementation in a Genetic Programming (GP) system shows that such measurements can indeed be performed and reflect well how evolution proceeds. Below the level of fitness development it provides observables at the genetic level of a GP population during evolution. We apply this measurement method to investigate the effects of four major configuration parameters in EC, i.e., mutation rate, crossover rate, tournament selection size, and population size, and show that some insights can be gained into the effectiveness of these parameters with respect to evolution acceleration. Further, we observe that population size plays an important role in determining the rate of evolution. We formulate a new indicator based on this rate of evolution measurement to adjust population size dynamically during evolution. Such a strategy can stabilize the rate of genetic substitutions and effectively improve the performance of a GP system over fixed-size populations. This rate of evolution measure also provides an avenue to study evolvability, since it captures how the two sides of evolvability, i.e., variability and neutrality, interact and cooperate with each other during evolution. We show that evolvability can be better understood in the light of this interplay and how this can be used to generate adaptive phenotypic variation via harnessing random genetic variation. The rate of evolution measure and the adaptive population size scheme are further transferred to a Genetic Algorithm (GA) to solve a real world application problem - the wireless network planning problem. Computer simulation of such an application proves that the adaptive population size scheme is able to improve a GA's performance against conventional fixed population size algorithms
Towards an Information Theoretic Framework for Evolutionary Learning
The vital essence of evolutionary learning consists of information flows between the environment and the entities differentially surviving and reproducing therein. Gain or loss of information in individuals and populations due to evolutionary steps should be considered in evolutionary algorithm theory and practice. Information theory has rarely been applied to evolutionary computation - a lacuna that this dissertation addresses, with an emphasis on objectively and explicitly evaluating the ensemble models implicit in evolutionary learning. Information theoretic functionals can provide objective, justifiable, general, computable, commensurate measures of fitness and diversity.
We identify information transmission channels implicit in evolutionary learning. We define information distance metrics and indices for ensembles. We extend Price\u27s Theorem to non-random mating, give it an effective fitness interpretation and decompose it to show the key factors influencing heritability and evolvability. We argue that heritability and evolvability of our information theoretic indicators are high. We illustrate use of our indices for reproductive and survival selection. We develop algorithms to estimate information theoretic quantities on mixed continuous and discrete data via the empirical copula and information dimension. We extend statistical resampling. We present experimental and real world application results: chaotic time series prediction; parity; complex continuous functions; industrial process control; and small sample social science data. We formalize conjectures regarding evolutionary learning and information geometry
Weighted Hierarchical Grammatical Evolution
Grammatical evolution (GE) is one of the most widespread techniques in evolutionary computation. Genotypes in GE are bit strings while phenotypes are strings, of a language defined by a user-provided context-free grammar. In this paper, we propose a novel procedure for mapping genotypes to phenotypes that we call weighted hierarchical GE (WHGE). WHGE imposes a form of hierarchy on the genotype and encodes grammar symbols with a varying number of bits based on the relative expressive power of those symbols. WHGE does not impose any constraint on the overall GE framework, in particular, WHGE may handle recursive grammars, uses the classical genetic operators, and does not need to define any bound in advance on the size of phenotypes. We assessed experimentally our proposal in depth on a set of challenging and carefully selected benchmarks, comparing the results of the standard GE framework as well as two of the most significant enhancements proposed in the literature: 1) position-independent GE and 2) structured GE. Our results show that WHGE delivers very good results in terms of fitness as well as in terms of the properties of the genotype-phenotype mapping procedure
Computational Intelligence for Life Sciences
Computational Intelligence (CI) is a computer science discipline encompassing the theory, design, development and application of biologically and linguistically derived computational paradigms. Traditionally, the main elements of CI are Evolutionary Computation, Swarm Intelligence, Fuzzy Logic, and Neural Networks. CI aims at proposing new algorithms able to solve complex computational problems by taking inspiration from natural phenomena. In an intriguing turn of events, these nature-inspired methods have been widely adopted to investigate a plethora of problems related to nature itself. In this paper we present a variety of CI methods applied to three problems in life sciences, highlighting their effectiveness: we describe how protein folding can be faced by exploiting Genetic Programming, the inference of haplotypes can be tackled using Genetic Algorithms, and the estimation of biochemical kinetic parameters can be performed by means of Swarm Intelligence. We show that CI methods can generate very high quality solutions, providing a sound methodology to solve complex optimization problems in life sciences
Strong Selection Significantly Increases Epistatic Interactions in the Long-Term Evolution of a Protein
Epistatic interactions between residues determine a protein's adaptability
and shape its evolutionary trajectory. When a protein experiences a changed
environment, it is under strong selection to find a peak in the new fitness
landscape. It has been shown that strong selection increases epistatic
interactions as well as the ruggedness of the fitness landscape, but little is
known about how the epistatic interactions change under selection in the
long-term evolution of a protein. Here we analyze the evolution of epistasis in
the protease of the human immunodeficiency virus type 1 (HIV-1) using protease
sequences collected for almost a decade from both treated and untreated
patients, to understand how epistasis changes and how those changes impact the
long-term evolvability of a protein. We use an information-theoretic proxy for
epistasis that quantifies the co-variation between sites, and show that
positive information is a necessary (but not sufficient) condition that detects
epistasis in most cases. We analyze the "fossils" of the evolutionary
trajectories of the protein contained in the sequence data, and show that
epistasis continues to enrich under strong selection, but not for proteins
whose environment is unchanged. The increase in epistasis compensates for the
information loss due to sequence variability brought about by treatment, and
facilitates adaptation in the increasingly rugged fitness landscape of
treatment. While epistasis is thought to enhance evolvability via
valley-crossing early-on in adaptation, it can hinder adaptation later when the
landscape has turned rugged. However, we find no evidence that the HIV-1
protease has reached its potential for evolution after 9 years of adapting to a
drug environment that itself is constantly changing.Comment: 25 pages, 9 figures, plus Supplementary Material including
Supplementary Text S1-S7, Supplementary Tables S1-S2, and Supplementary
Figures S1-2. Version that appears in PLoS Genetic
Association mapping in tetraploid potato
The results of a four year project within the Centre for BioSystems Genomics (www.cbsg.nl), entitled “Association mapping and family genotyping in potato” are described in this thesis. This project was intended to investigate whether a recently emerged methodology, association mapping, could provide the means to improve potato breeding efficiency. In an attempt to answer this research question a set of potato cultivars representative for the commercial potato germplasm was selected. In total 240 cultivars and progenitor clones were chosen. In a later stage this set was expanded with 190 recent breeds contributed by five participating breeding companies which resulted in a total of 430 genotypes. In a pilot experiment, the results of which are reported in Chapter 2, a subset of 220 of the abovementioned 240 cultivars and progenitor clones was used. Phenotypic data was retrieved through contributions of the participating breeding companies and represented summary statistics of recent observations for a number of traits across years and locations, calculated following company specific procedures. With AFLP marker data, in the form of normalised log-transformed band intensities, obtained from five well-known primer combinations, the extent of linkage disequilibrium (LD), using the r2 statistic, was estimated. Population structure within the set of 220 cultivars was analysed by deploying a clustering approach. No apparent, nor statistically supported population structure was revealed and the LD seemed to decay below the threshold of 0.1 at a genetic distance of about 3cM with this set of marker data. Furthermore, marker-trait associations were investigated by fitting single marker regression models for phenotypic traits on marker band intensities with and without correction for population structure. Population structure correction was performed in a straightforward way by incorporating a design matrix into the model assuming that each breeding company represented a different breeding germplasm pool. The potential of association mapping in tetraploid potato has been demonstrated in this pilot experiment, because existing phenotypic data, a modest number of AFLP markers, and a relatively straightforward statistical analysis allowed identification of interesting associations for a number of agro-morphological and quality traits. These promising results encouraged us to engage into an encompassing genome-wide association mapping study in potato. Two association mapping panels were compiled. One panel comprising 205 genotypes, all of which were also present in the set used for the pilot experiment, and another panel containing in total 299 genotypes including the entire set of 190 recent breeds together with a series of standard cultivars, about 100 of which are in common with the first panel. Phenotypic data for the association panel with 205 genotypes were obtained in a field trial performed in 2006 in Wageningen at two locations with two replicates. We will refer to this set as the “2006 field trial”. Phenotypic data for the other panel with 299 genotypes was contributed by the five participating breeding companies and consisted of multi-year-multi-location data obtained during generations of clonal selection. The 2006 data were nicely balanced, because the trial was designed in that way. The historical breeding dataset was highly unbalanced. Analysis of these two differing phenotypic datasets was performed to deliver insight in variance components for the genotypic main effects and the genotype by environment interaction (GEI), besides estimated genotype main effects across environments. Both phenotypic datasets were analysed separately within a mixed model framework including terms for GEI. In Chapter 3 we describe both phenotypic datasets by comparing variance components, heritabilities (=repeatabilities), intra-dataset relationships and inter-dataset relationships. Broader aspects related to phenotypic datasets and their analysis are discussed as well. To retrieve information about hidden population structure and genetic relatedness, and to estimate the extent of LD in potato germplasm, we used marker information generated with 41 AFLP primer combinations and 53 microsatellite loci on a collection of 430 genotypes. These 430 genotypes contain all genotypes present in the two association mapping panels introduced before plus a few extra genotypes to increase potato germplasm coverage. Two methods were used: a Bayesian approach and a distance-based clustering approach. Chapter 4 describes the results of this exercise. Both strategies revealed a weak level of structure in our material. Groups were detected which complied with criteria such as their intended market segment, as well as groups differing in their year of first registration on a national list. Linkage disequilibrium, using the r2 statistic, appeared to decay below the threshold of 0.1 across linkage groups at a genetic distance of about 5cM on average. The results described in Chapter 4 are promising for association mapping research in potato. The odds are reasonable that useful marker-trait associations can be detected and that the potential mapping resolution will suffice for detection of QTL in an association mapping context. In Chapter 5 a comprehensive genome-wide association mapping study is presented. The adjusted genotypic means obtained from two association mapping panels as a result of phenotypic analysis performed in Chapter 3 were combined with marker information in two association mapping models. Marker information consisted of normalised log-transformed band intensities of 41 AFLP primer combinations and allele dosage information from 53 microsatellites. A baseline model without correction for population structure and a more advanced model with correction for population structure and genetic relatedness were applied. Population structure and genetic relatedness were estimated using available marker information. Interesting QTL could be identified for 19 agro-morphological and quality traits. The observed QTL partly confirm previous studies e.g. for tuber shape and frying colour, but also new QTL have been detected e.g. for after baking darkening and enzymatic browning. In the final chapter, the general discussion, results of preceding chapters are evaluated and their implications for research as well as breeding are discussed. <br/
2009 Undergraduate Research Symposium Abstract Book
Abstract book from the 2009 UMM Undergraduate Research Symposium (URS) which celebrates student scholarly achievement and creative activities
Recommended from our members
Automatic Development and Adaptation of Concise Nonlinear Models for System Identification
Mathematical descriptions of natural and man-made processes are the bedrock of science, used by humans to understand, estimate, predict and control the natural and built world around them. The goal of system identification is to enable the inference of mathematical descriptions of the true behavior and dynamics of processes from their measured observations. The crux of this task is the identification of the dynamic model form (topology) in addition to its parameters. Model structures must be concise to offer insight to the user about the process in question. To that end, this dissertation proposes three methods to improve the ability of system identification to identify succinct nonlinear model structures.
The first is a model structure adaptation method (MSAM) that modifies first principles models to increase their predictive ability while maintaining intelligibility. Model structure identification is achieved by this method despite the presence of parametric error through a novel means of estimating the gradient of model structure perturbations. I demonstrate MSAM\u27s ability to identify underlying nonlinear dynamic models starting from linear models in the presence of parametric uncertainty. The main contribution of this method is the ability to adapt the structure of existing models of processes such that they more closely match the process observations.
The second method, known as epigenetic linear genetic programming (ELGP), conducts symbolic regression without a priori knowledge of the form of the model or its parameters. ELGP incorporates a layer of genetic regulation into genetic programming (GP) and adapts it by local search to tune the resultant model structures for accuracy and conciseness. The introduction of epigenetics is made simple by the use of a stack-based program representation. This method, tested on hundreds of dynamics problems, demonstrates the ability of epigenetic local search to improve GP by producing simpler and more accurate models.
The third method relies on a multidimensional GP approach (M4GP) for solving multiclass classification problems. The proposed method uses stack-based GP to conduct nonlinear feature transformations to optimize the clustering of data according to their classes. In comparison to several state-of-the-art methods, M4GP is able to classify test data better on several real-world problems. The main contribution of M4GP is its demonstrated ability to combine the strengths of GP (e.g. nonlinear feature transformations and feature selection) with the strengths of distance-based classification.
MSAM, ELGP and M4GP improve the identification of succinct nonlinear model structures for continuous dynamic processes with starting models, continuous dynamic processes without starting models, and multiclass dynamic processes without starting models, respectively. A considerable portion of this dissertation is devoted to the application of these methods to these three classes of real-world dynamic modeling problems. MSAM is applied to the restructuring of controllers to improve the closed-loop system response of nonlinear plants. ELGP is used to identify the closed-loop dynamics of an industrial scale wind turbine and to define a reduced-order model of fluid-structure interaction. Lastly, M4GP is used to identify a dynamic behavioral model of bald eagles from collected data. The methods are analyzed alongside many other state-of-the-art system identification methods in the context of model accuracy and conciseness
- …