26 research outputs found

    A Study of Dynamic Populations in Geometric Semantic Genetic Programming

    Get PDF
    Farinati, D., Bakurov, I., & Vanneschi, L. (2023). A Study of Dynamic Populations in Geometric Semantic Genetic Programming. Information Sciences, 648(November), 1-21. [119513]. https://doi.org/10.1016/j.ins.2023.119513 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS.Allowing the population size to variate during the evolution can bring advantages to evolutionary algorithms (EAs), retaining computational effort during the evolution process. Dynamic populations use computational resources wisely in several types of EAs, including genetic programming. However, so far, a thorough study on the use of dynamic populations in Geometric Semantic Genetic Programming (GSGP) is missing. Still, GSGP is a resource-greedy algorithm, and the use of dynamic populations seems appropriate. This paper adapts algorithms to GSGP to manage dynamic populations that were successful for other types of EAs and introduces two novel algorithms. The novel algorithms exploit the concept of semantic neighbourhood. These methods are assessed and compared through a set of eight regression problems. The results indicate that the algorithms outperform standard GSGP, confirming the suitability of dynamic populations for GSGP. Interestingly, the novel algorithms that use semantic neighbourhood to manage variation in population size are particularly effective in generating robust models even for the most difficult of the studied test problems.publishersversionpublishe

    Simplification of genetic programs: a literature survey

    Get PDF
    Genetic programming (GP), a widely used evolutionary computing technique, suffers from bloat—the problem of excessive growth in individuals’ sizes. As a result, its ability to efficiently explore complex search spaces reduces. The resulting solutions are less robust and generalisable. Moreover, it is difficult to understand and explain models which contain bloat. This phenomenon is well researched, primarily from the angle of controlling bloat: instead, our focus in this paper is to review the literature from an explainability point of view, by looking at how simplification can make GP models more explainable by reducing their sizes. Simplification is a code editing technique whose primary purpose is to make GP models more explainable. However, it can offer bloat control as an additional benefit when implemented and applied with caution. Researchers have proposed several simplification techniques and adopted various strategies to implement them. We organise the literature along multiple axes to identify the relative strengths and weaknesses of simplification techniques and to identify emerging trends and areas for future exploration. We highlight design and integration challenges and propose several avenues for research. One of them is to consider simplification as a standalone operator, rather than an extension of the standard crossover or mutation operators. Its role is then more clearly complementary to other GP operators, and it can be integrated as an optional feature into an existing GP setup. Another proposed avenue is to explore the lack of utilisation of complexity measures in simplification. So far, size is the most discussed measure, with only two pieces of prior work pointing out the benefits of using time as a measure when controlling bloat

    Mining Explicit and Implicit Relationships in Data Using Symbolic Regression

    Full text link
    Identification of implicit and explicit relations within observed data is a generic problem commonly encountered in several domains including science, engineering, finance, and more. It forms the core component of data analytics, a process of discovering useful information from data sets that are potentially huge and otherwise incomprehensible. In industries, such information is often instrumental for profitable decision making, whereas in science and engineering it is used to build empirical models, propose new or verify existing theories and explain natural phenomena. In recent times, digital and internet based technologies have proliferated, making it viable to generate and collect large amount of data at low cost. This inturn has resulted in an ever growing need for methods to analyse and draw interpretations from such data quickly and reliably. With this overarching goal, this thesis attempts to make contributions towards developing accurate and efficient methods for discovering such relations through evolutionary search, a method commonly referred to as Symbolic Regression (SR). A data set of input variables x and a corresponding observed response y is given. The aim is to find an explicit function y = f (x) or an implicit function f (x, y) = 0, which represents the data set. While seemingly simple, the problem is challenging for several reasons. Some of the conventional regression methods try to “guess” a functional form such as linear/quadratic/polynomial, and attempt to do a curve-fitting of the data to the equation, which may limit the possibility of discovering more complex relations, if they exist. On the other hand, there are meta-modelling techniques such as response surface method, Kriging, etc., that model the given data accurately, but provide a “black-box” predictor instead of an expression. Such approximations convey little or no insights about how the variables and responses are dependent on each other, or their relative contribution to the output. SR attempts to alleviate the above two extremes by providing a structure which evolves mathematical expressions instead of assuming them. Thus, it is flexible enough to represent the data, but at the same time provides useful insights instead of a black-box predictor. SR can be categorized as part of Explainable Artificial Intelligence and can contribute to Trustworthy Artificial Intelligence. The works proposed in this thesis aims to integrate the concept of “semantics” deeper into Genetic Programming (GP) and Evolutionary Feature Synthesis, which are the two algorithms usually employed for conducting SR. The semantics will be integrated into well-known components of the algorithms such as compactness, diversity, recombination, constant optimization, etc. The main contribution of this thesis is the proposal of two novel operators to generate expressions based on Linear Programming and Mixed Integer Programming with the aim of controlling the length of the discovered expressions without compromising on the accuracy. In the experiments, these operators are proven to be able to discover expressions with better accuracy and interpretability on many explicit and implicit benchmarks. Moreover, some applications of SR on real-world data sets are shown to demonstrate the practicality of the proposed approaches. Besides, in related to practical problems, how GP can be applied to effectively solve the Resource Constrained Scheduling Problems is also presented

    Evolutionary Computation

    Get PDF
    This book presents several recent advances on Evolutionary Computation, specially evolution-based optimization methods and hybrid algorithms for several applications, from optimization and learning to pattern recognition and bioinformatics. This book also presents new algorithms based on several analogies and metafores, where one of them is based on philosophy, specifically on the philosophy of praxis and dialectics. In this book it is also presented interesting applications on bioinformatics, specially the use of particle swarms to discover gene expression patterns in DNA microarrays. Therefore, this book features representative work on the field of evolutionary computation and applied sciences. The intended audience is graduate, undergraduate, researchers, and anyone who wishes to become familiar with the latest research work on this field

    Time Control or Size Control? Reducing Complexity and Improving Accuracy of Genetic Programming Models

    Get PDF
    Complexity of evolving models in genetic programming (GP) can impact both the quality of the models and the evolutionary search. While previous studies have proposed several notions of GP model complexity, the size of a GP model is by far the most researched measure of model complexity. However, previous studies have also shown that controlling the size does not automatically improve the accuracy of GP models, especially the accuracy on out of sample (test) data. Furthermore, size does not represent the functional composition of a model, which is often related to its accuracy on test data. In this study, we explore the {\em evaluation time} of GP models as a measure of their complexity; we define the evaluation time as the time taken to evaluate a model over some data. We demonstrate that the evaluation time reflects both a model’s size and its composition; also, we show how to measure the evaluation time reliably. To validate our proposal, we leverage four well-known methods to size-control but to control evaluation times instead of the tree sizes; we thus compare size-control with time-control. The results show that time-control with a nuanced notion of complexity produces more accurate models on 17 out of 20 problem scenarios. Even when the models have slightly greater times and sizes, time-control counterbalances via superior accuracy on both training and test data. The paper also argues that time-control can differentiate functional complexity even better in an identically-sized population. To facilitate this, the paper proposes Fixed Length Initialisation (FLI) that creates an identically-sized but functionally-diverse population. The results show that while FLI particularly suits time-control, it also generally improves the performance of size-control. Overall, the paper poses evaluation-time as a viable alternative to tree sizes to measure complexity in GP

    Learning a formula of interpretability to learn interpretable formulas

    Get PDF
    Many risk-sensitive applications require Machine Learning (ML) models to be interpretable. Attempts to obtain interpretable models typically rely on tuning, by trial-and-error, hyper-parameters of model complexity that are only loosely related to interpretability. We show that it is instead possible to take a meta-learning approach: an ML model of non-trivial Proxies of Human Interpretability (PHIs) can be learned from human feedback, then this model can be incorporated within an ML training process to directly optimize for interpretability. We show this for evolutionary symbolic regression. We first design and distribute a survey finalized at finding a link between features of mathematical formulas and two established PHIs, simulatability and decomposability. Next, we use the resulting dataset to learn an ML model of interpretability. Lastly, we query this model to estimate the interpretability of evolving solutions within bi-objective genetic programming. We perform experiments on five synthetic and eight real-world symbolic regression problems, comparing to the traditional use of solution size minimization. The results show that the use of our model leads to formulas that are, for a same level of accuracy-interpretability trade-off, either significantly more or equally accurate. Moreover, the formulas are also arguably more interpretable. Given the very positive results, we believe that our approach represents an important stepping stone for the design of next-generation interpretable (evolutionary) ML algorithms

    Evolving meaning: using genetic programming to learn similarity perspectives for mining biomedical data

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019Nos últimos anos, as ontologias biomédicas tornaram-se fundamentais para descrever o conhecimento biológico na forma de grafos de conhecimento. Consequentemente, foram propostas várias abordagens de mineração de dados que tiram partido destes grafos de conhecimento. Estas abordagens baseiam-se em representações vetoriais que podem não capturar toda a informação semântica subjacente aos grafos. Uma abordagem alternativa consiste em utilizar a semelhança semântica como representação semântica. No entanto, como as ontologias podem modelar várias perspetivas, a semelhança semântica pode ser calculada tendo em consideração diferentes aspetos. Deste modo, diferentes tarefas de aprendizagem automática podem exigir diferentes perspetivas do grafo de conhecimento. Selecionar os aspetos semânticos mais relevantes, ou a melhor combinação destes para suportar uma determinada tarefa de aprendizagem não é trivial e, normalmente, exige conhecimento especializado. Nesta dissertação, apresentamos uma nova abordagem usando a Programação Genética sobre um conjunto de semelhanças semânticas, cada uma calculada com base num aspeto semântico dos dados, para obter a melhor combinação para uma dada tarefa de aprendizagem supervisionada. A metodologia inclui três etapas sequenciais: calcular a semelhança semântica para cada aspeto semântico; aprender a melhor combinação desses aspetos usando a Programação Genética; integrar a melhor combinação com o algoritmo de classificação. A abordagem foi avaliada em nove conjuntos de dados para prever a interação entre proteínas. Nesta aplicação, a Gene Ontology foi utilizada como grafo de conhecimento para suportar o cálculo da semelhança semântica. Como referência, utilizámos uma variação da abordagem proposta com estratégias manuais frequentemente utilizadas para combinar os aspetos semânticos. Os resultados demonstraram que as combinações obtidas com a Programação Genética superaram as combinações escolhidas manualmente que emulam o conhecimento especializado. A nossa abordagem foi também capaz de aprender modelos agnósticos em relação à espécie usando diferentes combinações de espécies para treino e teste, ultrapassando assim as limitações de prever interações entre proteínas para espécies com poucas interações conhecidas. Esta nova metodologia supera as limitações impostas pela necessidade de selecionar manualmente os aspetos semânticos que devem ser considerados para uma dada tarefa de aprendizagem. A aplicação da metodologia à previsão da interação entre proteínas foi bem-sucedida, perspetivando outras aplicações.In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. In this dissertation, we developed a novel approach that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The methodology includes three sequential steps: compute the semantic similarity for each semantic aspect; learn the best combination of those aspects using Genetic Programming; integrate the best combination with a classification algorithm. The approach was evaluated on several benchmark datasets of protein-protein interaction prediction. The quality of the classifications is evaluated using the weighted average F-measure for each dataset. As a baseline, we employed a variation of the proposed methodology that instead of using evolved combinations, uses static combinations. For protein-protein interaction prediction, Gene Ontology was used as the knowledge graph to support semantic similarity, and it outperformed manually selected combinations of semantic aspects emulating expert knowledge. Our approach was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting proteinprotein interactions for species with fewer known interactions. This dissertation proposes a novel methodology to overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. The methodology is particularly important for biomedical applications where data is often complex and multi-domain. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications

    Nonlinear Dynamic System Identification and Model Predictive Control Using Genetic Programming

    Get PDF
    During the last century, a lot of developments have been made in research of complex nonlinear process control. As a powerful control methodology, model predictive control (MPC) has been extensively applied to chemical industrial applications. Core to MPC is a predictive model of the dynamics of the system being controlled. Most practical systems exhibit complex nonlinear dynamics, which imposes big challenges in system modelling. Being able to automatically evolve both model structure and numeric parameters, Genetic Programming (GP) shows great potential in identifying nonlinear dynamic systems. This thesis is devoted to GP based system identification and model-based control of nonlinear systems. To improve the generalization ability of GP models, a series of experiments that use semantic-based local search within a multiobjective GP framework are reported. The influence of various ways of selecting target subtrees for local search as well as different methods for performing that search were investigated; a comparison with the Random Desired Operator (RDO) of Pawlak et al. was made by statistical hypothesis testing. Compared with the corresponding baseline GP algorithms, models produced by a standard steady state or generational GP followed by a carefully-designed single-objective GP implementing semantic-based local search are statistically more accurate and with smaller (or equal) tree size, compared with the RDO-based GP algorithms. Considering the practical application, how to correctly and efficiently apply an evolved GP model to other larger systems is a critical research concern. Currently, the replication of GP models is normally done by repeating other’s work given the necessary algorithm parameters. However, due to the empirical and stochastic nature of GP, it is difficult to completely reproduce research findings. An XML-based standard file format, named Genetic Programming Markup Language (GPML), is proposed for the interchange of GP trees. A formal definition of this standard and details of implementation are described. GPML provides convenience and modularity for further applications based on GP models. The large-scale adoption of MPC in buildings is not economically viable due to the time and cost involved in designing and adjusting predictive models by expert control engineers. A GP-based control framework is proposed for automatically evolving dynamic nonlinear models for the MPC of buildings. An open-loop system identification was conducted using the data generated by a building simulator, and the obtained GP model was then employed to construct the predictive model for the MPC. The experimental result shows GP is able to produce models that allow the MPC of building to achieve the desired temperature band in a single zone space

    Local Search is Underused in Genetic Programming

    Get PDF
    Trujillo, L., Z-Flores, E., Juárez-Smith, P. S., Legrand, P., Silva, S., Castelli, M., ... Muñoz, L. (2018). Local Search is Underused in Genetic Programming. In R. Riolo, B. Worzel, B. Goldman, & B. Tozier (Eds.), Genetic Programming Theory and Practice XIV (pp. 119-137). [8] (Genetic and Evolutionary Computation). Springer. https://doi.org/10.1007/978-3-319-97088-2_8There are two important limitations of standard tree-based genetic programming (GP). First, GP tends to evolve unnecessarily large programs, what is referred to as bloat. Second, GP uses inefficient search operators that focus on modifying program syntax. The first problem has been studied extensively, with many works proposing bloat control methods. Regarding the second problem, one approach is to use alternative search operators, for instance geometric semantic operators, to improve convergence. In this work, our goal is to experimentally show that both problems can be effectively addressed by incorporating a local search optimizer as an additional search operator. Using real-world problems, we show that this rather simple strategy can improve the convergence and performance of tree-based GP, while also reducing program size. Given these results, a question arises: Why are local search strategies so uncommon in GP? A small survey of popular GP libraries suggests to us that local search is underused in GP systems. We conclude by outlining plausible answers for this question and highlighting future work.authorsversionpublishe
    corecore