5,618 research outputs found

    A Multi-Gene Genetic Programming Application for Predicting Students Failure at School

    Full text link
    Several efforts to predict student failure rate (SFR) at school accurately still remains a core problem area faced by many in the educational sector. The procedure for forecasting SFR are rigid and most often times require data scaling or conversion into binary form such as is the case of the logistic model which may lead to lose of information and effect size attenuation. Also, the high number of factors, incomplete and unbalanced dataset, and black boxing issues as in Artificial Neural Networks and Fuzzy logic systems exposes the need for more efficient tools. Currently the application of Genetic Programming (GP) holds great promises and has produced tremendous positive results in different sectors. In this regard, this study developed GPSFARPS, a software application to provide a robust solution to the prediction of SFR using an evolutionary algorithm known as multi-gene genetic programming. The approach is validated by feeding a testing data set to the evolved GP models. Result obtained from GPSFARPS simulations show its unique ability to evolve a suitable failure rate expression with a fast convergence at 30 generations from a maximum specified generation of 500. The multi-gene system was also able to minimize the evolved model expression and accurately predict student failure rate using a subset of the original expressionComment: 14 pages, 9 figures, Journal paper. arXiv admin note: text overlap with arXiv:1403.0623 by other author

    Genetic Programming for Object Detection : a Two-Phase Approach with an Improved Fitness Function

    Get PDF
    This paper describes two innovations that improve the efficiency and effectiveness of a genetic programming approach to object detection problems. The approach uses genetic programming to construct object detection programs that are applied, in a moving window fashion, to the large images to locate the objects of interest. The first innovation is to break the GP search into two phases with the first phase applied to a selected subset of the training data, and a simplified fitness function. The second phase is initialised with the programs from the first phase, and uses the full set of training data with a complete fitness function to construct the final detection programs. The second innovation is to add a program size component to the fitness function. This approach is examined and compared with a neural network approach on three object detection problems of increasing difficulty. The results suggest that the innovations increase both the effectiveness and the efficiency of the genetic programming search, and also that the genetic programming approach outperforms a neural network approach for the most difficult data set in terms of the object detection accuracy

    Automated retrieval and extraction of training course information from unstructured web pages

    Get PDF
    Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance

    A Field Guide to Genetic Programming

    Get PDF
    xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction -- Representation, initialisation and operators in Tree-based GP -- Getting ready to run genetic programming -- Example genetic programming run -- Alternative initialisations and operators in Tree-based GP -- Modular, grammatical and developmental Tree-based GP -- Linear and graph genetic programming -- Probalistic genetic programming -- Multi-objective genetic programming -- Fast and distributed genetic programming -- GP theory and its applications -- Applications -- Troubleshooting GP -- Conclusions.Contents xi 1 Introduction 1.1 Genetic Programming in a Nutshell 1.2 Getting Started 1.3 Prerequisites 1.4 Overview of this Field Guide I Basics 2 Representation, Initialisation and GP 2.1 Representation 2.2 Initialising the Population 2.3 Selection 2.4 Recombination and Mutation Operators in Tree-based 3 Getting Ready to Run Genetic Programming 19 3.1 Step 1: Terminal Set 19 3.2 Step 2: Function Set 20 3.2.1 Closure 21 3.2.2 Sufficiency 23 3.2.3 Evolving Structures other than Programs 23 3.3 Step 3: Fitness Function 24 3.4 Step 4: GP Parameters 26 3.5 Step 5: Termination and solution designation 27 4 Example Genetic Programming Run 4.1 Preparatory Steps 29 4.2 Step-by-Step Sample Run 31 4.2.1 Initialisation 31 4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming 5 Alternative Initialisations and Operators in 5.1 Constructing the Initial Population 5.1.1 Uniform Initialisation 5.1.2 Initialisation may Affect Bloat 5.1.3 Seeding 5.2 GP Mutation 5.2.1 Is Mutation Necessary? 5.2.2 Mutation Cookbook 5.3 GP Crossover 5.4 Other Techniques 32 5.5 Tree-based GP 39 6 Modular, Grammatical and Developmental Tree-based GP 47 6.1 Evolving Modular and Hierarchical Structures 47 6.1.1 Automatically Defined Functions 48 6.1.2 Program Architecture and Architecture-Altering 50 6.2 Constraining Structures 51 6.2.1 Enforcing Particular Structures 52 6.2.2 Strongly Typed GP 52 6.2.3 Grammar-based Constraints 53 6.2.4 Constraints and Bias 55 6.3 Developmental Genetic Programming 57 6.4 Strongly Typed Autoconstructive GP with PushGP 59 7 Linear and Graph Genetic Programming 61 7.1 Linear Genetic Programming 61 7.1.1 Motivations 61 7.1.2 Linear GP Representations 62 7.1.3 Linear GP Operators 64 7.2 Graph-Based Genetic Programming 65 7.2.1 Parallel Distributed GP (PDGP) 65 7.2.2 PADO 67 7.2.3 Cartesian GP 67 7.2.4 Evolving Parallel Programs using Indirect Encodings 68 8 Probabilistic Genetic Programming 8.1 Estimation of Distribution Algorithms 69 8.2 Pure EDA GP 71 8.3 Mixing Grammars and Probabilities 74 9 Multi-objective Genetic Programming 75 9.1 Combining Multiple Objectives into a Scalar Fitness Function 75 9.2 Keeping the Objectives Separate 76 9.2.1 Multi-objective Bloat and Complexity Control 77 9.2.2 Other Objectives 78 9.2.3 Non-Pareto Criteria 80 9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80 9.4 Multi-objective Optimisation via Operator Bias 81 10 Fast and Distributed Genetic Programming 83 10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83 10.2 Reducing Cost of Fitness with Caches 86 10.3 Parallel and Distributed GP are Not Equivalent 88 10.4 Running GP on Parallel Hardware 89 10.4.1 Master–slave GP 89 10.4.2 GP Running on GPUs 90 10.4.3 GP on FPGAs 92 10.4.4 Sub-machine-code GP 93 10.5 Geographically Distributed GP 93 11 GP Theory and its Applications 97 11.1 Mathematical Models 98 11.2 Search Spaces 99 11.3 Bloat 101 11.3.1 Bloat in Theory 101 11.3.2 Bloat Control in Practice 104 III Practical Genetic Programming 12 Applications 12.1 Where GP has Done Well 12.2 Curve Fitting, Data Modelling and Symbolic Regression 12.3 Human Competitive Results – the Humies 12.4 Image and Signal Processing 12.5 Financial Trading, Time Series, and Economic Modelling 12.6 Industrial Process Control 12.7 Medicine, Biology and Bioinformatics 12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii 12.9 Entertainment and Computer Games 127 12.10The Arts 127 12.11Compression 128 13 Troubleshooting GP 13.1 Is there a Bug in the Code? 13.2 Can you Trust your Results? 13.3 There are No Silver Bullets 13.4 Small Changes can have Big Effects 13.5 Big Changes can have No Effect 13.6 Study your Populations 13.7 Encourage Diversity 13.8 Embrace Approximation 13.9 Control Bloat 13.10 Checkpoint Results 13.11 Report Well 13.12 Convince your Customers 14 Conclusions Tricks of the Trade A Resources A.1 Key Books A.2 Key Journals A.3 Key International Meetings A.4 GP Implementations A.5 On-Line Resources 145 B TinyGP 151 B.1 Overview of TinyGP 151 B.2 Input Data Files for TinyGP 153 B.3 Source Code 154 B.4 Compiling and Running TinyGP 162 Bibliography 167 Inde

    A Survey on Evolutionary Computation for Computer Vision and Image Analysis: Past, Present, and Future Trends

    Get PDF
    Computer vision (CV) is a big and important field in artificial intelligence covering a wide range of applications. Image analysis is a major task in CV aiming to extract, analyse and understand the visual content of images. However, imagerelated tasks are very challenging due to many factors, e.g., high variations across images, high dimensionality, domain expertise requirement, and image distortions. Evolutionary computation (EC) approaches have been widely used for image analysis with significant achievement. However, there is no comprehensive survey of existing EC approaches to image analysis. To fill this gap, this paper provides a comprehensive survey covering all essential EC approaches to important image analysis tasks including edge detection, image segmentation, image feature analysis, image classification, object detection, and others. This survey aims to provide a better understanding of evolutionary computer vision (ECV) by discussing the contributions of different approaches and exploring how and why EC is used for CV and image analysis. The applications, challenges, issues, and trends associated to this research field are also discussed and summarised to provide further guidelines and opportunities for future research

    Optimization of feature learning through grammar-guided genetic programming

    Get PDF
    Tese de Mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasMachine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods that can be considered FL methods, such as Neural Networks and PCA, lack interpretability. A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation (DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process. In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing operating arithmetically on categorical features. For example, the multiplication of the postal code of an individual with their wage is not deemed sensible and thus disallowed. In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This method can be used for time series and panel datasets, to aggregate the target value of historic data based on a known feature value of a new data point. For example, if I want to predict the number of bikes seen daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore, DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can include the average number of bikes seen on past Sundays. I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently, DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher prediction performance for one problem. DKA-M3GP has a much better prediction performance than M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it performed well in one of two problems. Overall, my methods show potential for FL. Both methods are implemented in Genetic Engine an individual-representation-independent GGGP framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows competing performance with the mature GGGP framework PonyGE2.A Inteligência Artificial (IA) e o seu subconjunto de Aprendizagem Automática (AA) estão a tornarse mais importantes para nossas vidas a cada dia que passa. Ambas as áreas estão presentes no nosso dia a dia em diversas aplicações como o reconhecimento automático de voz, os carros autónomos, ou o reconhecimento de imagens e deteção de objetos. A AA foi aplicada com sucesso em muitas áreas, como saúde, finanças e marketing. Num contexto supervisionado, os modelos de AA são treinados com dados e, posteriormente, são usados para prever o comportamento de dados futuros. A combinação de etapas realizadas para construir um modelo de AA, totalmente treinado e avaliado, é chamada um AA pipeline, ou simplesmente pipeline. Todos os pipelines seguem etapas obrigatórias, nomeadamente a recuperação, limpeza e manipulação dos dados, a seleção e construção de features, a seleção do modelo e a otimização dos seus parâmetros, finalmente, a avaliação do modelo. A construção de AA pipelines é uma tarefa desafiante, com especificidades que dependem do domínio do problema. Existem desafios do lado do design, otimização de hiperparâmetros, assim como no lado da implementação. No desenho de pipelines, as escolhas devem ser feitas em relação aos componentes a utilizar e à sua ordem. Mesmo para especialistas em AA, desenhar pipelines é uma tarefa entediante . As escolhas de design exigem experiência em AA e um conhecimento do domínio do problema, o que torna a construção do pipeline num processo intensivo de recursos. Após o desenho do pipeline, os parâmetros do mesmo devem ser otimizados para melhorar o seu desempenho. A otimização de parâmetros, geralmente, requer a execução e avaliação sequencial do pipeline, envolvendo altos custos. No lado da implementação, os programadores podem introduzir bugs durante o processo de desenvolvimento. Esses bugs podem levar à perda de tempo e dinheiro para serem corrigidos, e, se não forem detectados, podem comprometer a robustez e correção do modelo ou introduzir problemas de desempenho. Para contornar esses problemas de design e implementação, surgiu uma nova linha de investigação designada por AutoML (Automated Machine Learning). AutoML visa automatizar o desenho de AA pipelines, a otimização de parâmetros, e a sua implementação. Uma parte importante dos pipelines de AA é a maneira como os features dos dados são manipulados. A manipulação de dados tem muitos aspetos, reunidos sob o termo genérico Feature Engineering (FE). Em suma, FE visa melhorar a qualidade do espaço de solução selecionando as features mais importantes e construindo novas features relevantes. Contudo, este é um processo que consome muitos recursos, pelo que a sua automação é uma sub-área altamente recompensadora de AutoML. Nesta tese, defino Feature Learning (FL) como a área de FE automatizado. Uma métrica importante de FE e, portanto, de FL, é a interpretabilidade das features aprendidas. Interpretabilidade, que se enquadra na área de Explainable IA (XIA), refere-se à facilidade de entender o significado de uma feature. A ocorrência de diversos escândalos em IA, como modelos racistas e sexistas, levaram a União Europeia a propor legislação sobre modelos sem interpretabilidade. Muitos métodos clássicos, e portanto amplamente usados, carecem de interpretabilidade, dando origem ao interesse recémdescoberto em XIA. A atual investigação em FL trata os valores de features existentes sem os relacionar com o seu significado semântico. Por exemplo, engenharia de uma feature que representa a multiplicação do código postal com a idade de uma pessoa não é um uso lógico do código postal. Embora os códigos postais possam ser representados como números inteiros, eles devem ser tratados como valores categóricos. A prevenção deste tipo de interações entre features, melhora o desempenho do pipeline, uma vez que reduz o espaço de procura de possíveis features ficando apenas com as que fazem semanticamente sentido. Além disso, este processo resulta em features que são intrinsecamente interpretáveis. Deste modo, o conhecimento sobre o domínio do problema, impede a engenharia de features sem significado durante o processo de FE.. Outro aspecto de FL normalmente não considerado nos métodos existentes, é a agregação de valores de uma única feature por várias entidades de dados. Por exemplo, vamos considerar um conjunto de dados sobre fraude de cartão de crédito. A quantidade média de transações anteriores de um cartão é potencialmente uma feature interessante para incluir, pois transmite o significado de uma transação ’normal’. No entanto, isso geralmente não é diretamente inferível nos métodos de FL existentes. Refirome a este método de FL como agregação de entidades, ou simplesmente agregação. Por fim, apesar da natureza imprevisível dos conjuntos de dados da vida real, os métodos existentes exigem principalmente features que tenham dados homogêneos. Isso exige que os cientistas de dados realizem um pré-processamento do conjunto de dados. Muitas vezes, isso requer transformar categorias em números inteiros ou algum tipo de codificação, como por exemplo one-hot encoding. Contudo, conforme discutido acima, isso pode reduzir a interpretabilidade e o desempenho do pipeline. A Programação Genética (GP), um método de ML, é também usado para FL e permite a criação de modelos mais interpretáveis que a maioria dos métodos tradicionais. GP é um método baseado em procura que evolui programas ou, no caso de FL, mapeamentos entre apresentas de espaços. Os métodos de FL baseados em GP existentes não incorporam os três aspectos acima mencionados: o conhecimento do domínio, a agregação e a conformidade com tipos de dados heterogêneos. Algumas abordagens incorporam algumas partes desses aspetos, principalmente usando gramáticas para orientar o processo de procura. O objetivo deste trabalho é explorar se a GP consegue usar gramáticas para melhorar a qualidade da FL, quer em termos de desempenho preditivo ou de interpretabilidade. Primeiro, construímos o Genetic Engine, uma framework de GP guiada por gramática (Grammar-Guided GP (GGGP)). O Genetic Engine é uma framework de GGGP fácil de usar que permite expressar gramáticas complexas. Mostramos que o Genetic Engine tem um bom desempenho quando comparado com a framework de Python do estado da arte, PonyGE2. Em segundo lugar, proponho dois novos métodos de FL baseados em GGGP implementados no Genetic Engine. Ambos os métodos estendem o M3GP, o método FL do estado da arte baseado em GP. A primeira incorpora o conhecimento do domínio, denominado M3GP com conhecimento do domínio (M3GP with Domain Knowledge (DK-M3GP)). O primeiro método restringe o comportamento das features permitindo apenas interações sensatas, por meio de condições e declarações. O segundo método estende X DK-M3GP, introduzindo agregação no espaço de procura, e é denominado DK-M3GP com Agregação (DK-M3GP with Aggregation (DKA-M3GP)). O DKA-M3GP usa totalmente a facilidade de implementação do Genetic Engine, pois requer a implementação de uma gramática complexa. Neste trabalho, o DK-M3GP e DKA-M3GP foram avaliados em comparação com o GP Tradicional, M3GP e numerosos métodos clássicos de FL em dois problemas de ML. As novas abordagens foram avaliadas assumindo que são métodos autônomos de FL e fazendo parte de uma pipeline maior. Como métodos FL independentes, ambos os métodos demonstram boa previsão de desempenho em pelo menos um dos dois problemas. Como parte da pipeline, os métodos apresentam pouca vantagem em relação aos métodos clássicos no seu desempenho de previsão. Após a análise dos resultados, uma possível explicação encontra-se no overfitting dos métodos FL para a função de fitness e no conjunto de dados de treino. O Neste trabalho, discuto também a melhoria na interpretabilidade após incorporar conhecimento do domínio no processo de procura. Uma avaliação preliminar do DK-M3GP indica que, utilizando a medida de complexidade Expression Size (ES), é possível obter uma melhoria na interpretabilidade. Todavia, verifiquei também que a medida de complexidade utilizada pode não ser a mais adequada devido a estrutura de características em forma de árvore das características construídas por DK-M3GP que potencia um ES. Considero que um método de avaliação de interpretabilidade mais complexo deve apontar isso

    Acta Cybernetica : Volume 16. Number 2.

    Get PDF

    Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews

    Full text link
    Online reviews are an important source of feedback for understanding customers. In this study, we follow novel approaches that target this absence of actionable insights by classifying reviews as defect reports and requests for improvement. Unlike traditional classification methods based on expert rules, we reduce the manual labour by employing a supervised system that is capable of learning lexico-semantic patterns through genetic programming. Additionally, we experiment with a distantly-supervised SVM that makes use of noisy labels generated by patterns. Using a real-world dataset of app reviews, we show that the automatically learned patterns outperform the manually created ones, to be generated. Also the distantly-supervised SVM models are not far behind the pattern-based solutions, showing the usefulness of this approach when the amount of annotated data is limited.Comment: Accepted for publication in the 25th International Conference on Natural Language & Information Systems (NLDB 2020), DFKI Saarbr\"ucken Germany, June 24-26 202

    Hierarchically organised genetic algorithm for fuzzy network synthesis

    Get PDF
    corecore