5,618 research outputs found
A Multi-Gene Genetic Programming Application for Predicting Students Failure at School
Several efforts to predict student failure rate (SFR) at school accurately
still remains a core problem area faced by many in the educational sector. The
procedure for forecasting SFR are rigid and most often times require data
scaling or conversion into binary form such as is the case of the logistic
model which may lead to lose of information and effect size attenuation. Also,
the high number of factors, incomplete and unbalanced dataset, and black boxing
issues as in Artificial Neural Networks and Fuzzy logic systems exposes the
need for more efficient tools. Currently the application of Genetic Programming
(GP) holds great promises and has produced tremendous positive results in
different sectors. In this regard, this study developed GPSFARPS, a software
application to provide a robust solution to the prediction of SFR using an
evolutionary algorithm known as multi-gene genetic programming. The approach is
validated by feeding a testing data set to the evolved GP models. Result
obtained from GPSFARPS simulations show its unique ability to evolve a suitable
failure rate expression with a fast convergence at 30 generations from a
maximum specified generation of 500. The multi-gene system was also able to
minimize the evolved model expression and accurately predict student failure
rate using a subset of the original expressionComment: 14 pages, 9 figures, Journal paper. arXiv admin note: text overlap
with arXiv:1403.0623 by other author
Genetic Programming for Object Detection : a Two-Phase Approach with an Improved Fitness Function
This paper describes two innovations that improve the efficiency and effectiveness of a genetic programming approach to object detection problems. The approach uses genetic programming to construct object detection programs that are applied, in a moving window fashion, to the large images to locate the objects of interest. The first innovation is to break the GP search into two phases with the first phase applied to a selected subset of the training data, and a simplified fitness function. The second phase is initialised with the programs from the first phase, and uses the full set of training data with a complete fitness function to construct the final detection programs. The second innovation is to add a program size component to the fitness function. This approach is examined and compared with a neural network approach on three object detection problems of increasing difficulty. The results suggest that the innovations increase both the effectiveness and the efficiency of the genetic programming search, and also that the genetic programming approach outperforms a neural network approach for the most difficult data set in terms of the object detection accuracy
Automated retrieval and extraction of training course information from unstructured web pages
Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results.
This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations.
The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance
A Field Guide to Genetic Programming
xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction --
Representation, initialisation and operators in Tree-based GP --
Getting ready to run genetic programming --
Example genetic programming run --
Alternative initialisations and operators in Tree-based GP --
Modular, grammatical and developmental Tree-based GP --
Linear and graph genetic programming --
Probalistic genetic programming --
Multi-objective genetic programming --
Fast and distributed genetic programming --
GP theory and its applications --
Applications --
Troubleshooting GP --
Conclusions.Contents
xi
1 Introduction
1.1 Genetic Programming in a Nutshell
1.2 Getting Started
1.3 Prerequisites
1.4 Overview of this Field Guide I
Basics
2 Representation, Initialisation and GP
2.1 Representation
2.2 Initialising the Population
2.3 Selection
2.4 Recombination and Mutation Operators in Tree-based
3 Getting Ready to Run Genetic Programming 19
3.1 Step 1: Terminal Set 19
3.2 Step 2: Function Set 20
3.2.1 Closure 21
3.2.2 Sufficiency 23
3.2.3 Evolving Structures other than Programs 23
3.3 Step 3: Fitness Function 24
3.4 Step 4: GP Parameters 26
3.5 Step 5: Termination and solution designation 27
4 Example Genetic Programming Run
4.1 Preparatory Steps 29
4.2 Step-by-Step Sample Run 31
4.2.1 Initialisation 31
4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming
5 Alternative Initialisations and Operators in
5.1 Constructing the Initial Population
5.1.1 Uniform Initialisation
5.1.2 Initialisation may Affect Bloat
5.1.3 Seeding
5.2 GP Mutation
5.2.1 Is Mutation Necessary?
5.2.2 Mutation Cookbook
5.3 GP Crossover
5.4 Other Techniques 32
5.5 Tree-based GP 39
6 Modular, Grammatical and Developmental Tree-based GP 47
6.1 Evolving Modular and Hierarchical Structures 47
6.1.1 Automatically Defined Functions 48
6.1.2 Program Architecture and Architecture-Altering 50
6.2 Constraining Structures 51
6.2.1 Enforcing Particular Structures 52
6.2.2 Strongly Typed GP 52
6.2.3 Grammar-based Constraints 53
6.2.4 Constraints and Bias 55
6.3 Developmental Genetic Programming 57
6.4 Strongly Typed Autoconstructive GP with PushGP 59
7 Linear and Graph Genetic Programming 61
7.1 Linear Genetic Programming 61
7.1.1 Motivations 61
7.1.2 Linear GP Representations 62
7.1.3 Linear GP Operators 64
7.2 Graph-Based Genetic Programming 65
7.2.1 Parallel Distributed GP (PDGP) 65
7.2.2 PADO 67
7.2.3 Cartesian GP 67
7.2.4 Evolving Parallel Programs using Indirect Encodings 68
8 Probabilistic Genetic Programming
8.1 Estimation of Distribution Algorithms 69
8.2 Pure EDA GP 71
8.3 Mixing Grammars and Probabilities 74
9 Multi-objective Genetic Programming 75
9.1 Combining Multiple Objectives into a Scalar Fitness Function 75
9.2 Keeping the Objectives Separate 76
9.2.1 Multi-objective Bloat and Complexity Control 77
9.2.2 Other Objectives 78
9.2.3 Non-Pareto Criteria 80
9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80
9.4 Multi-objective Optimisation via Operator Bias 81
10 Fast and Distributed Genetic Programming 83
10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83
10.2 Reducing Cost of Fitness with Caches 86
10.3 Parallel and Distributed GP are Not Equivalent 88
10.4 Running GP on Parallel Hardware 89
10.4.1 Master–slave GP 89
10.4.2 GP Running on GPUs 90
10.4.3 GP on FPGAs 92
10.4.4 Sub-machine-code GP 93
10.5 Geographically Distributed GP 93
11 GP Theory and its Applications 97
11.1 Mathematical Models 98
11.2 Search Spaces 99
11.3 Bloat 101
11.3.1 Bloat in Theory 101
11.3.2 Bloat Control in Practice 104
III
Practical Genetic Programming
12 Applications
12.1 Where GP has Done Well
12.2 Curve Fitting, Data Modelling and Symbolic Regression
12.3 Human Competitive Results – the Humies
12.4 Image and Signal Processing
12.5 Financial Trading, Time Series, and Economic Modelling
12.6 Industrial Process Control
12.7 Medicine, Biology and Bioinformatics
12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii
12.9 Entertainment and Computer Games 127
12.10The Arts 127
12.11Compression 128
13 Troubleshooting GP
13.1 Is there a Bug in the Code?
13.2 Can you Trust your Results?
13.3 There are No Silver Bullets
13.4 Small Changes can have Big Effects
13.5 Big Changes can have No Effect
13.6 Study your Populations
13.7 Encourage Diversity
13.8 Embrace Approximation
13.9 Control Bloat
13.10 Checkpoint Results
13.11 Report Well
13.12 Convince your Customers
14 Conclusions
Tricks of the Trade
A Resources
A.1 Key Books
A.2 Key Journals
A.3 Key International Meetings
A.4 GP Implementations
A.5 On-Line Resources 145
B TinyGP 151
B.1 Overview of TinyGP 151
B.2 Input Data Files for TinyGP 153
B.3 Source Code 154
B.4 Compiling and Running TinyGP 162
Bibliography 167
Inde
A Survey on Evolutionary Computation for Computer Vision and Image Analysis: Past, Present, and Future Trends
Computer vision (CV) is a big and important field
in artificial intelligence covering a wide range of applications.
Image analysis is a major task in CV aiming to extract, analyse
and understand the visual content of images. However, imagerelated
tasks are very challenging due to many factors, e.g., high
variations across images, high dimensionality, domain expertise
requirement, and image distortions. Evolutionary computation
(EC) approaches have been widely used for image analysis with
significant achievement. However, there is no comprehensive
survey of existing EC approaches to image analysis. To fill
this gap, this paper provides a comprehensive survey covering
all essential EC approaches to important image analysis tasks
including edge detection, image segmentation, image feature
analysis, image classification, object detection, and others. This
survey aims to provide a better understanding of evolutionary
computer vision (ECV) by discussing the contributions of different
approaches and exploring how and why EC is used for
CV and image analysis. The applications, challenges, issues, and
trends associated to this research field are also discussed and
summarised to provide further guidelines and opportunities for
future research
Optimization of feature learning through grammar-guided genetic programming
Tese de Mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasMachine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as
Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods
that can be considered FL methods, such as Neural Networks and PCA, lack interpretability.
A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with
multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL
methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation
(DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process.
In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing
operating arithmetically on categorical features. For example, the multiplication of the postal code of an
individual with their wage is not deemed sensible and thus disallowed.
In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This
method can be used for time series and panel datasets, to aggregate the target value of historic data based
on a known feature value of a new data point. For example, if I want to predict the number of bikes seen
daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore,
DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can
include the average number of bikes seen on past Sundays.
I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently,
DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher
prediction performance for one problem. DKA-M3GP has a much better prediction performance than
M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it
performed well in one of two problems. Overall, my methods show potential for FL.
Both methods are implemented in Genetic Engine an individual-representation-independent GGGP
framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows
competing performance with the mature GGGP framework PonyGE2.A Inteligência Artificial (IA) e o seu subconjunto de Aprendizagem Automática (AA) estão a tornarse mais importantes para nossas vidas a cada dia que passa. Ambas as áreas estão presentes no nosso
dia a dia em diversas aplicações como o reconhecimento automático de voz, os carros autónomos, ou o
reconhecimento de imagens e deteção de objetos. A AA foi aplicada com sucesso em muitas áreas, como
saúde, finanças e marketing.
Num contexto supervisionado, os modelos de AA são treinados com dados e, posteriormente, são usados para prever o comportamento de dados futuros. A combinação de etapas realizadas para construir um
modelo de AA, totalmente treinado e avaliado, é chamada um AA pipeline, ou simplesmente pipeline.
Todos os pipelines seguem etapas obrigatórias, nomeadamente a recuperação, limpeza e manipulação
dos dados, a seleção e construção de features, a seleção do modelo e a otimização dos seus parâmetros,
finalmente, a avaliação do modelo. A construção de AA pipelines é uma tarefa desafiante, com especificidades que dependem do domínio do problema. Existem desafios do lado do design, otimização de
hiperparâmetros, assim como no lado da implementação.
No desenho de pipelines, as escolhas devem ser feitas em relação aos componentes a utilizar e à sua
ordem. Mesmo para especialistas em AA, desenhar pipelines é uma tarefa entediante . As escolhas de
design exigem experiência em AA e um conhecimento do domínio do problema, o que torna a construção
do pipeline num processo intensivo de recursos.
Após o desenho do pipeline, os parâmetros do mesmo devem ser otimizados para melhorar o seu
desempenho. A otimização de parâmetros, geralmente, requer a execução e avaliação sequencial do
pipeline, envolvendo altos custos. No lado da implementação, os programadores podem introduzir bugs
durante o processo de desenvolvimento. Esses bugs podem levar à perda de tempo e dinheiro para serem
corrigidos, e, se não forem detectados, podem comprometer a robustez e correção do modelo ou introduzir
problemas de desempenho. Para contornar esses problemas de design e implementação, surgiu uma nova
linha de investigação designada por AutoML (Automated Machine Learning). AutoML visa automatizar
o desenho de AA pipelines, a otimização de parâmetros, e a sua implementação. Uma parte importante
dos pipelines de AA é a maneira como os features dos dados são manipulados. A manipulação de dados
tem muitos aspetos, reunidos sob o termo genérico Feature Engineering (FE). Em suma, FE visa melhorar
a qualidade do espaço de solução selecionando as features mais importantes e construindo novas features
relevantes. Contudo, este é um processo que consome muitos recursos, pelo que a sua automação é uma
sub-área altamente recompensadora de AutoML. Nesta tese, defino Feature Learning (FL) como a área
de FE automatizado. Uma métrica importante de FE e, portanto, de FL, é a interpretabilidade das features aprendidas. Interpretabilidade, que se enquadra na área de Explainable IA (XIA), refere-se à facilidade de entender o
significado de uma feature. A ocorrência de diversos escândalos em IA, como modelos racistas e sexistas, levaram a União Europeia a propor legislação sobre modelos sem interpretabilidade. Muitos métodos
clássicos, e portanto amplamente usados, carecem de interpretabilidade, dando origem ao interesse recémdescoberto em XIA. A atual investigação em FL trata os valores de features existentes sem os relacionar
com o seu significado semântico. Por exemplo, engenharia de uma feature que representa a multiplicação
do código postal com a idade de uma pessoa não é um uso lógico do código postal. Embora os códigos
postais possam ser representados como números inteiros, eles devem ser tratados como valores categóricos. A prevenção deste tipo de interações entre features, melhora o desempenho do pipeline, uma vez
que reduz o espaço de procura de possíveis features ficando apenas com as que fazem semanticamente
sentido. Além disso, este processo resulta em features que são intrinsecamente interpretáveis. Deste
modo, o conhecimento sobre o domínio do problema, impede a engenharia de features sem significado
durante o processo de FE..
Outro aspecto de FL normalmente não considerado nos métodos existentes, é a agregação de valores
de uma única feature por várias entidades de dados. Por exemplo, vamos considerar um conjunto de
dados sobre fraude de cartão de crédito. A quantidade média de transações anteriores de um cartão
é potencialmente uma feature interessante para incluir, pois transmite o significado de uma transação
’normal’. No entanto, isso geralmente não é diretamente inferível nos métodos de FL existentes. Refirome a este método de FL como agregação de entidades, ou simplesmente agregação.
Por fim, apesar da natureza imprevisível dos conjuntos de dados da vida real, os métodos existentes
exigem principalmente features que tenham dados homogêneos. Isso exige que os cientistas de dados realizem um pré-processamento do conjunto de dados. Muitas vezes, isso requer transformar categorias em
números inteiros ou algum tipo de codificação, como por exemplo one-hot encoding. Contudo, conforme
discutido acima, isso pode reduzir a interpretabilidade e o desempenho do pipeline.
A Programação Genética (GP), um método de ML, é também usado para FL e permite a criação
de modelos mais interpretáveis que a maioria dos métodos tradicionais. GP é um método baseado em
procura que evolui programas ou, no caso de FL, mapeamentos entre apresentas de espaços. Os métodos
de FL baseados em GP existentes não incorporam os três aspectos acima mencionados: o conhecimento
do domínio, a agregação e a conformidade com tipos de dados heterogêneos. Algumas abordagens incorporam algumas partes desses aspetos, principalmente usando gramáticas para orientar o processo de
procura. O objetivo deste trabalho é explorar se a GP consegue usar gramáticas para melhorar a qualidade da FL, quer em termos de desempenho preditivo ou de interpretabilidade. Primeiro, construímos
o Genetic Engine, uma framework de GP guiada por gramática (Grammar-Guided GP (GGGP)). O Genetic Engine é uma framework de GGGP fácil de usar que permite expressar gramáticas complexas.
Mostramos que o Genetic Engine tem um bom desempenho quando comparado com a framework de
Python do estado da arte, PonyGE2.
Em segundo lugar, proponho dois novos métodos de FL baseados em GGGP implementados no Genetic Engine. Ambos os métodos estendem o M3GP, o método FL do estado da arte baseado em GP.
A primeira incorpora o conhecimento do domínio, denominado M3GP com conhecimento do domínio (M3GP with Domain Knowledge (DK-M3GP)). O primeiro método restringe o comportamento das features permitindo apenas interações sensatas, por meio de condições e declarações. O segundo método
estende X DK-M3GP, introduzindo agregação no espaço de procura, e é denominado DK-M3GP com
Agregação (DK-M3GP with Aggregation (DKA-M3GP)). O DKA-M3GP usa totalmente a facilidade de
implementação do Genetic Engine, pois requer a implementação de uma gramática complexa.
Neste trabalho, o DK-M3GP e DKA-M3GP foram avaliados em comparação com o GP Tradicional,
M3GP e numerosos métodos clássicos de FL em dois problemas de ML. As novas abordagens foram
avaliadas assumindo que são métodos autônomos de FL e fazendo parte de uma pipeline maior. Como
métodos FL independentes, ambos os métodos demonstram boa previsão de desempenho em pelo menos
um dos dois problemas. Como parte da pipeline, os métodos apresentam pouca vantagem em relação
aos métodos clássicos no seu desempenho de previsão. Após a análise dos resultados, uma possível
explicação encontra-se no overfitting dos métodos FL para a função de fitness e no conjunto de dados de
treino. O
Neste trabalho, discuto também a melhoria na interpretabilidade após incorporar conhecimento do
domínio no processo de procura. Uma avaliação preliminar do DK-M3GP indica que, utilizando a medida de complexidade Expression Size (ES), é possível obter uma melhoria na interpretabilidade. Todavia,
verifiquei também que a medida de complexidade utilizada pode não ser a mais adequada devido a estrutura de características em forma de árvore das características construídas por DK-M3GP que potencia
um ES. Considero que um método de avaliação de interpretabilidade mais complexo deve apontar isso
Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews
Online reviews are an important source of feedback for understanding
customers. In this study, we follow novel approaches that target this absence
of actionable insights by classifying reviews as defect reports and requests
for improvement. Unlike traditional classification methods based on expert
rules, we reduce the manual labour by employing a supervised system that is
capable of learning lexico-semantic patterns through genetic programming.
Additionally, we experiment with a distantly-supervised SVM that makes use of
noisy labels generated by patterns. Using a real-world dataset of app reviews,
we show that the automatically learned patterns outperform the manually created
ones, to be generated. Also the distantly-supervised SVM models are not far
behind the pattern-based solutions, showing the usefulness of this approach
when the amount of annotated data is limited.Comment: Accepted for publication in the 25th International Conference on
Natural Language & Information Systems (NLDB 2020), DFKI Saarbr\"ucken
Germany, June 24-26 202
- …