5,991 research outputs found
Semantic variation operators for multidimensional genetic programming
Multidimensional genetic programming represents candidate solutions as sets
of programs, and thereby provides an interesting framework for exploiting
building block identification. Towards this goal, we investigate the use of
machine learning as a way to bias which components of programs are promoted,
and propose two semantic operators to choose where useful building blocks are
placed during crossover. A forward stagewise crossover operator we propose
leads to significant improvements on a set of regression problems, and produces
state-of-the-art results in a large benchmark study. We discuss this
architecture and others in terms of their propensity for allowing heuristic
search to utilize information during the evolutionary process. Finally, we look
at the collinearity and complexity of the data representations that result from
these architectures, with a view towards disentangling factors of variation in
application.Comment: 9 pages, 8 figures, GECCO 201
A Study of Geometric Semantic Genetic Programming with Linear Scaling
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceMachine Learning (ML) is a scientific discipline that endeavors to enable computers
to learn without the need for explicit programming. Evolutionary Algorithms (EAs),
a subset of ML algorithms, mimic Darwin’s Theory of Evolution by using natural
selection mechanisms (i.e., survival of the fittest) to evolve a group of individuals
(i.e., possible solutions to a given problem). Genetic Programming (GP) is the most
recent type of EA and it evolves computer programs (i.e., individuals) to map a set of
input data into known expected outputs. Geometric Semantic Genetic Programming
(GSGP) extends this concept by allowing individuals to evolve and vary in the semantic
space, where the output vectors are located, rather than being constrained by syntaxbased
structures. Linear Scaling (LS) is a method that was introduced to facilitate the
task of GP of searching for the best function matching a set of known data. GSGP
and LS have both, independently, shown the ability to outperform standard GP for
symbolic regression. GSGP uses Geometric Semantic Operators (GSOs), different
from the standard ones, without altering the fitness, while LS modifies the fitness
without altering the genetic operators. To the best of our knowledge, there has been
no prior utilization of the combined methodology of GSGP and LS for classification
problems. Furthermore, despite the fact that they have been used together in one
practical regression application, a methodological evaluation of the advantages and
disadvantages of integrating these methods for regression or classification problems
has never been performed. In this dissertation, a study of a system that integrates both
GSGP and LS (GSGP-LS) is presented. The performance of the proposed method, GSGPLS,
was tested on six hand-tailored regression benchmarks, nine real-life regression
problems and three real-life classification problems. The obtained results indicate that
GSGP-LS outperforms GSGP in the majority of the cases, confirming the expected
benefit of this integration. However, for some particularly hard regression datasets,
GSGP-LS overfits training data, being outperformed by GSGP on unseen data. This
contradicts the idea that LS is always beneficial for GP, warning the practitioners about
its risk of overfitting in some specific cases.A Aprendizagem Automática (AA) é uma disciplina científica que se esforça por
permitir que os computadores aprendam sem a necessidade de programação explícita.
Algoritmos Evolutivos (AE),um subconjunto de algoritmos de ML, mimetizam a Teoria
da Evolução de Darwin, usando a seleção natural e mecanismos de "sobrevivência dos
mais aptos"para evoluir um grupo de indivíduos (ou seja, possíveis soluções para
um problema dado). A Programação Genética (PG) é um processo algorítmico que
evolui programas de computador (ou indivíduos) para ligar características de entrada e
saída. A Programação Genética em Geometria Semântica (PGGS) estende esse conceito
permitindo que os indivíduos evoluam e variem no espaço semântico, onde os vetores
de saída estão localizados, em vez de serem limitados por estruturas baseadas em
sintaxe. A Escala Linear (EL) é um método introduzido para facilitar a tarefa da PG de
procurar a melhor função que corresponda a um conjunto de dados conhecidos. Tanto
a PGGS quanto a EL demonstraram, independentemente, a capacidade de superar a
PG padrão para regressão simbólica. A PGGS usa Operadores Semânticos Geométricos
(OSGs), diferentes dos padrões, sem alterar o fitness, enquanto a EL modifica o fitness
sem alterar os operadores genéticos. Até onde sabemos, não houve utilização prévia
da metodologia combinada de PGGS e EL para problemas de classificação. Além disso,
apesar de terem sido usados juntos em uma aplicação prática de regressão, nunca foi
realizada uma avaliação metodológica das vantagens e desvantagens da integração
desses métodos para problemas de regressão ou classificação. Nesta dissertação, é
apresentado um estudo de um sistema que integra tanto a PGGS quanto a EL (PGGSEL).
O desempenho do método proposto, PGGS-EL, foi testado em seis benchmarks de
regressão personalizados, nove problemas de regressão da vida real e três problemas
de classificação da vida real. Os resultados obtidos indicam que o PGGS-EL supera
o PGGS na maioria dos casos, confirmando o benefício esperado desta integração.
No entanto, para alguns conjuntos de dados de regressão particularmente difíceis, o
PGGS-EL faz overfit aos dados de treino, obtendo piores resultados em comparação com
PGGS em dados não vistos. Isso contradiz a ideia de que a EL é sempre benéfica para
a PG, alertando os praticantes sobre o risco de overfitting em alguns casos específicos
Advanced Genetic Programming vs. State-of-the-Art AutoML in Imbalanced Binary Classification
The objective of this article is to provide a comparative analysis of two novel genetic programming (GP) techniques, differentiable Cartesian genetic programming for artificial neural networks (DCGPANN) and geometric semantic genetic programming (GSGP), with state-of-the-art automated machine learning (AutoML) tools, namely Auto-Keras, Auto-PyTorch and Auto-Sklearn. While all these techniques are compared to several baseline algorithms upon their introduction, research still lacks direct comparisons between them, especially of the GP approaches with state-of-the-art AutoML. This study intends to fill this gap in order to analyze the true potential of GP for AutoML. The performances of the different tools are assessed by applying them to 20 benchmark datasets of the imbalanced binary classification field, thus an area that is a frequent and challenging problem. The tools are compared across the four categories average performance, maximum performance, standard deviation within performance, and generalization ability, whereby the metrics F1-score, G-mean, and AUC are used for evaluation. The analysis finds that the GP techniques, while unable to completely outperform state-of-the-art AutoML, are indeed already a very competitive alternative. Therefore, these advanced GP tools prove that they are able to provide a new and promising approach for practitioners developing machine learning (ML) models. Doi: 10.28991/ESJ-2023-07-04-021 Full Text: PD
Evolving Decision Rules with Geometric Semantic Genetic Programming
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDue to the ever increasing amount of data available in today’s world, a variety of
methods to harness this information are continuously being created, refined and
utilized, drawing inspiration from a multitude of sources. Relevant to this work are
Supervised Learning techniques, that attempt to discover the relationship between the
characteristics of data and a certain feature, to uncover the function that maps input
to output. Among these, Genetic Programming (GP) attempts to replicate the concept
of evolution as defined by Charles Darwin, mimicking natural selection and genetic
operators to generate and improve a population of solutions for a given prediction
problem.
Among the possible variants of GP, Geometric Semantic Genetic Programming
(GSGP) stands out, due to its focus on the meaning of each individual it creates, rather
than their structure. It achieves by imagining an hypothetical and perfect model, and
evaluating the performance of others by measuring how much their behaviour differ
from it, and uses a set of genetic operators that have a specific effect on the individual’s
semantics (i.e., its predictions for training data), with the goal of reaching ever closer
to the so called perfect specimen.
This thesis conceptualizes and evaluates the performance of aGSGPimplementation
made specifically to deal with multi-class classification problems, using tree-based
individuals that are composed by a set of rules to allow the categorization of data. This
is achieved through the careful translation of GSGP’s theoretical foundation, first into
algorithms and then into an actual code library, able to tackle problems of this domain.
The results demonstrate that the implementation works successfully and respects the
properties of the the original technique, allowing us to obtain excellent results on
training data, although performance on unseen data is a slightly worse than that of
other state-of-the-art algorithms.Devido à crescente quantidade de dados do mundo de hoje, uma variedade de métodos
para utilizar esta informação é continuamente criada, melhorada e utilizado, com
inspiração de diversas fontes. Com particular relevância para este trabalho são técnicas
de Supervised Learning, que visam descobrir a relação entre as características dos
dados e um traço específico destes, de modo a encontrar uma função que consiga
mapear os inputs aos outputs. Entre estas, Programação Genética (PG) tenta recriar o
conceito de evolução como definido por Charles Darwin, imitando a seleção natural e
operadores genéticos para gerar e melhorar uma população de soluções para um dado
problema preditivo.
Entre as possíveis variantes de PG, Programação Genética em Geometria Semântica
(PGGS) é notável, pois coloca o seu foco no significado de cada indivíduo que cria,
em vez da sua estrutura. Realiza isto ao imaginar um modelo hipotético e perfeito,
e avaliar as capacidades dos outros medindo o quão diferente o seu comportamento
difere deste, e utiliza um conjunto de operadores genéticos com um efeito específico
na semântica de um indíviduo (i.e., as suas previsões para dados de treino), visando
chegar cada vez mais perto ao tão chamado espécime perfeito.
Esta tese conceptualiza e avalia o desempenho de uma implementação de PGGS
feita especificamente para lidar com problemas de classificação multi-classe, utilizando
indivíduos baseados em árvores compostos por uma série de regras que permitem a
categorização de dados. Isto é feito através de uma tradução cuidadosa da base teórica
de PGGS, primeiro para algoritmos e depois para uma biblioteca de código, capaz de
enfrentar problemas deste domínio. Os resultados demonstram que a implementação
funciona corretamente e respeita as propriedades da técnica original, permitindo que
obtivéssemos resultados excelentes nos dados de treino, embora o desempenho em
dados não vistos seja ligeiramente abaixo de outros algoritmos de última geração
On the Hybridization of Geometric Semantic GP with Gradient-based Optimizers
Pietropolli, G., Manzoni, L., Paoletti, A., & Castelli, M. (2023). On the Hybridization of Geometric Semantic GP with Gradient-based Optimizers. Genetic Programming And Evolvable Machines, 24(2 Special Issue on Highlights of Genetic Programming 2022 Events), 1-20. [16]. https://doi.org/10.21203/rs.3.rs-2229748/v1, https://doi.org/10.1007/s10710-023-09463-1---Open access funding provided by Università degli Studi di Trieste within the CRUI-CARE Agreement. This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the Project—UIDB/04152/2020—Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMSGeometric semantic genetic programming (GSGP) is a popular form of GP where the effect of crossover and mutation can be expressed as geometric operations on a semantic space. A recent study showed that GSGP can be hybridized with a standard gradient-based optimized, Adam, commonly used in training artificial neural networks.We expand upon that work by considering more gradient-based optimizers, a deeper investigation of their parameters, how the hybridization is performed, and a more comprehensive set of benchmark problems. With the correct choice of hyperparameters, this hybridization improves the performances of GSGP and allows it to reach the same fitness values with fewer fitness evaluations.publishersversionepub_ahead_of_prin
Improving Tree-based Pipeline Optimization Tool with Geometric Semantic Genetic Programming
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceMachine Learning (ML) is becoming part of our lives, from face recognition to sensors of the latest cars. However, the construction of its pipelines is a time-consuming and expensive process, even for experts that have the knowledge in ML algorithms, due to the several options for each step. To overcome this issue, Automated ML (AutoML) was introduced, automating some steps of this process. One of its recent algorithms is Tree-Based Pipeline Optimization Tool (TPOT), an Evolutionary Algorithm (EA) that automatically designs and optimizes ML pipelines using Genetic Programming (GP). Another recent algorithm is Geometric Semantic Genetic Programming (GSGP), an EA characterized by using the semantics, the vector of outputs of a program on the different training data, and by searching directly in the space of semantics of the program through geometric semantic operators, leading to a unimodal fitness landscape. In this work, a new version of TPOT was created, called TPOT-GSGP, where GSGP is one of the options for model selection. This new algorithm was implemented in Python, only for regression problems and using Negative Mean Absolute Error as measurement error. Five case studies were used to compare the performance of three algorithms: TPOT-GSGP, the original TPOT, and GSGP. Additionally, the statistical significance of the difference on the last generation’s score for each combination of two algorithms was checked with Wilcoxon tests. There was not a single algorithm that outperformed the others in all datasets, sometimes it was TPOT-GSGP and others TPOT, depending on the case study and on the score that was analysed (learning or test). It was concluded that every time GSGP is chosen as root 50% of the times or more, TPOT-GSGP outperformed TPOT on the test set. Therefore, the advantages of this new algorithm can be extraordinary with its development and adjustment in future work
Ensemble learning with GSGP
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsThe purpose of this thesis is to conduct comparative research between Genetic Programming
(GP) and Geometric Semantic Genetic Programming (GSGP), with different
initialization (RHH and EDDA) and selection (Tournament and Epsilon-Lexicase)
strategies, in the context of a model-ensemble in order to solve regression optimization
problems.
A model-ensemble is a combination of base learners used in different ways to solve
a problem. The most common ensemble is the mean, where the base learners are combined
in a linear fashion, all having the same weights. However, more sophisticated
ensembles can be inferred, providing higher generalization ability.
GSGP is a variant of GP using different genetic operators. No previous research has
been conducted to see if GSGP can perform better than GP in model-ensemble learning.
The evolutionary process of GP and GSGP should allow us to learn about the strength
of each of those base models to provide a more accurate and robust solution. The
base-models used for this analysis were Linear Regression, Random Forest, Support
Vector Machine and Multi-Layer Perceptron. This analysis has been conducted using 7
different optimization problems and 4 real-world datasets. The results obtained with
GSGP are statistically significantly better than GP for most cases.O objetivo desta tese é realizar pesquisas comparativas entre Programação Genética
(GP) e Programação Genética Semântica Geométrica (GSGP), com diferentes estratégias
de inicialização (RHH e EDDA) e seleção (Tournament e Epsilon-Lexicase), no
contexto de um conjunto de modelos, a fim de resolver problemas de otimização de
regressão.
Um conjunto de modelos é uma combinação de alunos de base usados de diferentes
maneiras para resolver um problema. O conjunto mais comum é a média, na qual
os alunos da base são combinados de maneira linear, todos com os mesmos pesos.
No entanto, conjuntos mais sofisticados podem ser inferidos, proporcionando maior
capacidade de generalização.
O GSGP é uma variante do GP usando diferentes operadores genéticos. Nenhuma
pesquisa anterior foi realizada para verificar se o GSGP pode ter um desempenho
melhor que o GP no aprendizado de modelos. O processo evolutivo do GP e GSGP
deve permitir-nos aprender sobre a força de cada um desses modelos de base para
fornecer uma solução mais precisa e robusta. Os modelos de base utilizados para esta
análise foram: Regressão Linear, Floresta Aleatória, Máquina de Vetor de Suporte e
Perceptron de Camadas Múltiplas. Essa análise foi realizada usando 7 problemas de
otimização diferentes e 4 conjuntos de dados do mundo real. Os resultados obtidos
com o GSGP são estatisticamente significativamente melhores que o GP na maioria
dos casos
Credit scoring using genetic programming
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsGrowing numbers in e-commerce orders lead to an increase in risk management to prevent default in payment. Default in payment is the failure of a customer to settle a bill within 90 days upon receipt. Frequently, credit scoring is employed to identify customers’ default probability. Credit scoring has been widely studied and many different methods in different fields of research have been proposed.
The primary aim of this work is to develop a credit scoring model as a replacement for the pre risk check of the e-commerce risk management system risk solution services (rss). The pre risk check uses data of the order process and includes exclusion rules and a generic credit scoring model. The new model is supposed to work as a replacement for the whole pre risk check and has to be able to work in solitary and in unison with the rss main risk check. An application of Genetic Programming to credit scoring is presented. The model is developed on a real world data set provided by Arvato Financial Solutions. The data set contains order requests processed by rss. Results show that Genetic Programming outperforms the generic credit scoring model of the pre risk check in both classification accuracy and profit. Compared with Logistic Regression, Support Vector Machines and Boosted Trees,
Genetic Programming achieved a similar classificatory accuracy. Furthermore, the Genetic Programming model can be used in combination with the rss main risk check in order to create a model with higher discriminatory power than its individual models
- …