46 research outputs found
Evolutionary Computation and QSAR Research
[Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P
Multivariate data mining for estimating the rate of discolouration material accumulation in drinking water distribution systems
Particulate material accumulates over time as cohesive layers on internal pipeline surfaces in water distribution systems (WDS). When mobilised, this material can cause discolouration. This paper explores factors expected to be involved in this accumulation process. Two complementary machine learning methodologies are applied to significant amounts of real world field data from both a qualitative and a quantitative perspective. First, Kohonen self-organising maps were used for integrative and interpretative multivariate data mining of potential factors affecting accumulation. Second, evolutionary polynomial regression (EPR), a hybrid data-driven technique, was applied that combines genetic algorithms with numerical regression for developing easily interpretable mathematical model expressions. EPR was used to explore producing novel simple expressions to highlight important accumulation factors. Three case studies are presented: UK national and two Dutch local studies. The results highlight bulk water iron concentration, pipe material and looped network areas as key descriptive parameters for the UK study. At the local level, a significantly increased third data set allowed K-fold cross validation. The mean cross validation coefficient of determination was 0.945 for training data and 0.930 for testing data for an equation utilising amount of material mobilised and soil temperature for estimating daily regeneration rate. The approach shows promise for developing transferable expressions usable for pro-active WDS management
A Field Guide to Genetic Programming
xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction --
Representation, initialisation and operators in Tree-based GP --
Getting ready to run genetic programming --
Example genetic programming run --
Alternative initialisations and operators in Tree-based GP --
Modular, grammatical and developmental Tree-based GP --
Linear and graph genetic programming --
Probalistic genetic programming --
Multi-objective genetic programming --
Fast and distributed genetic programming --
GP theory and its applications --
Applications --
Troubleshooting GP --
Conclusions.Contents
xi
1 Introduction
1.1 Genetic Programming in a Nutshell
1.2 Getting Started
1.3 Prerequisites
1.4 Overview of this Field Guide I
Basics
2 Representation, Initialisation and GP
2.1 Representation
2.2 Initialising the Population
2.3 Selection
2.4 Recombination and Mutation Operators in Tree-based
3 Getting Ready to Run Genetic Programming 19
3.1 Step 1: Terminal Set 19
3.2 Step 2: Function Set 20
3.2.1 Closure 21
3.2.2 Sufficiency 23
3.2.3 Evolving Structures other than Programs 23
3.3 Step 3: Fitness Function 24
3.4 Step 4: GP Parameters 26
3.5 Step 5: Termination and solution designation 27
4 Example Genetic Programming Run
4.1 Preparatory Steps 29
4.2 Step-by-Step Sample Run 31
4.2.1 Initialisation 31
4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming
5 Alternative Initialisations and Operators in
5.1 Constructing the Initial Population
5.1.1 Uniform Initialisation
5.1.2 Initialisation may Affect Bloat
5.1.3 Seeding
5.2 GP Mutation
5.2.1 Is Mutation Necessary?
5.2.2 Mutation Cookbook
5.3 GP Crossover
5.4 Other Techniques 32
5.5 Tree-based GP 39
6 Modular, Grammatical and Developmental Tree-based GP 47
6.1 Evolving Modular and Hierarchical Structures 47
6.1.1 Automatically Defined Functions 48
6.1.2 Program Architecture and Architecture-Altering 50
6.2 Constraining Structures 51
6.2.1 Enforcing Particular Structures 52
6.2.2 Strongly Typed GP 52
6.2.3 Grammar-based Constraints 53
6.2.4 Constraints and Bias 55
6.3 Developmental Genetic Programming 57
6.4 Strongly Typed Autoconstructive GP with PushGP 59
7 Linear and Graph Genetic Programming 61
7.1 Linear Genetic Programming 61
7.1.1 Motivations 61
7.1.2 Linear GP Representations 62
7.1.3 Linear GP Operators 64
7.2 Graph-Based Genetic Programming 65
7.2.1 Parallel Distributed GP (PDGP) 65
7.2.2 PADO 67
7.2.3 Cartesian GP 67
7.2.4 Evolving Parallel Programs using Indirect Encodings 68
8 Probabilistic Genetic Programming
8.1 Estimation of Distribution Algorithms 69
8.2 Pure EDA GP 71
8.3 Mixing Grammars and Probabilities 74
9 Multi-objective Genetic Programming 75
9.1 Combining Multiple Objectives into a Scalar Fitness Function 75
9.2 Keeping the Objectives Separate 76
9.2.1 Multi-objective Bloat and Complexity Control 77
9.2.2 Other Objectives 78
9.2.3 Non-Pareto Criteria 80
9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80
9.4 Multi-objective Optimisation via Operator Bias 81
10 Fast and Distributed Genetic Programming 83
10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83
10.2 Reducing Cost of Fitness with Caches 86
10.3 Parallel and Distributed GP are Not Equivalent 88
10.4 Running GP on Parallel Hardware 89
10.4.1 Master–slave GP 89
10.4.2 GP Running on GPUs 90
10.4.3 GP on FPGAs 92
10.4.4 Sub-machine-code GP 93
10.5 Geographically Distributed GP 93
11 GP Theory and its Applications 97
11.1 Mathematical Models 98
11.2 Search Spaces 99
11.3 Bloat 101
11.3.1 Bloat in Theory 101
11.3.2 Bloat Control in Practice 104
III
Practical Genetic Programming
12 Applications
12.1 Where GP has Done Well
12.2 Curve Fitting, Data Modelling and Symbolic Regression
12.3 Human Competitive Results – the Humies
12.4 Image and Signal Processing
12.5 Financial Trading, Time Series, and Economic Modelling
12.6 Industrial Process Control
12.7 Medicine, Biology and Bioinformatics
12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii
12.9 Entertainment and Computer Games 127
12.10The Arts 127
12.11Compression 128
13 Troubleshooting GP
13.1 Is there a Bug in the Code?
13.2 Can you Trust your Results?
13.3 There are No Silver Bullets
13.4 Small Changes can have Big Effects
13.5 Big Changes can have No Effect
13.6 Study your Populations
13.7 Encourage Diversity
13.8 Embrace Approximation
13.9 Control Bloat
13.10 Checkpoint Results
13.11 Report Well
13.12 Convince your Customers
14 Conclusions
Tricks of the Trade
A Resources
A.1 Key Books
A.2 Key Journals
A.3 Key International Meetings
A.4 GP Implementations
A.5 On-Line Resources 145
B TinyGP 151
B.1 Overview of TinyGP 151
B.2 Input Data Files for TinyGP 153
B.3 Source Code 154
B.4 Compiling and Running TinyGP 162
Bibliography 167
Inde
Validación de las redes neuronales artificiales como metodología para la asignación donante-receptor en el trasplante hepático
1. Introducción o motivación de la tesis. El trasplante hepático constituye la mejor opción terapéutica para un gran número de patologías hepáticas en fase terminal. Desafortunadamente, existe un disbalance entre el número de candidatos y el número de donantes disponibles, lo que conlleva a muertes y exclusiones en lista de espera. En los últimos años se han realizado numerosos esfuerzos para incrementar el pool de donantes, así como para optimizar la priorización en lista de los posibles receptores. Entre ellos, destacan la utilización de los denominados “donantes con criterios extendidos” (ECD, extended criteria donors) y la adopción de un sistema de priorización mediante un score basado en la gravedad del candidato (MELD, Mayo Model for End Stage Liver Disease). La asignación donante-receptor es un factor determinante en los resultados del trasplante hepático, para lo cual se han propuesto múltiples “scores” en la literatura. Sin embargo, ninguno de ellos se considera óptimo para realizar este emparejamiento. En 2014, nuestro grupo publicó la utilidad de las redes neuronales artificiales (RNA) como una herramienta óptima para el matching donante-receptor en el trasplante hepático. Para ello se realizó un estudio multicéntrico a nivel nacional, en el que se demostró la superioridad de este modelo para predecir la supervivencia post-trasplante. El objetivo de nuestro estudio es analizar si las redes neuronales tienen un comportamiento similar al demostrado en España en un sistema de salud diferente, y si son una herramienta superior a los modelos actuales utilizados para el matching donante-receptor. 2. Contenido de la investigación. Se recogieron 822 pares donante-receptor (D-R) de trasplantes hepáticos realizados de forma consecutiva en el hospital King’s College de Londres durante los años 2002 a 2010, teniendo en cuenta variables del donante, del receptor y del trasplante. Para cada par, se calcularon dos probabilidades: la probabilidad de supervivencia (CCR) y la probabilidad de pérdida del injerto (MS) a los 3 meses del trasplante. Para ello se construyeron dos modelos de redes neuronales artificiales diferentes y no complementarios: el modelo de aceptación y el modelo de rechazo. Se construyeron varios modelos: 1) Entrenamiento y generalización con los pares D-R del hospital británico (a 3 y a 12 meses post-trasplante) , 2) Entrenamiento con pares D-R españoles y generalización con los británicos y 3) Modelo combinado: entrena y generaliza con pares españoles y británicos. Además, para ayudar en la toma de decisiones según los resultados obtenidos por la red neuronal, se construyó un sistema basado en reglas. Los modelos diseñados para el hospital King’s College demostraron una excelente capacidad de predicción para ambos: 3 meses (CCR-AUC=0,9375; MS-AUC=0,9374) y 12 meses (CCR-AUC=0,7833; MS-AUC=0,8153), casi un 15% superior a la mejor capacidad de predicción obtenida por otros scores como MELD o BAR (Balance of Risk). Además, estos resultados mejoran los publicados previamente en el modelo multicéntrico español. Sin embargo, esta capacidad de predicción no es tan buena cuando el modelo entrena y generaliza con pares D-R procedentes de sistemas de salud diferentes, ni tampoco en el modelo combinado. 3.Conclusiones. 1. El empleo de Redes Neuronales Artificiales para la Asignación Donante-Receptor en el Trasplante Hepático ha demostrado excelentes capacidades de predicción de Supervivencia y No Supervivencia del injerto, al ser validadas en un sistema de salud distinto de otro país, por lo tanto la metodología de la Inteligencia Artificial ha quedado claramente validada como herramienta óptima para el “matching D-R”. 2. Nuestros resultados apoyan que los distintos equipos de Trasplante Hepático consideren las Redes Neuronales Artificiales como el método más exhaustivo y objetivo descrito hasta la fecha para el manejo de la lista de espera del Trasplante Hepático, evitando criterios subjetivos y arbitrarios y maximizando los principios de equidad, utilidad y eficiencia. 3. Nuestro modelo de validación, es decir, la RNA generada con pares D-R del Hospital King’s College de Londres ha logrado la máxima capacidad de predicción, superando el resto de modelos y apoyando el hecho de que cada RNA debe ser entrenada, testada y optimizada para un propósito específico, en una única población. Así, cada programa de TH debería disponer de su propio modelo construido con sus propios datos, para apoyar la decisión del “matching D-R”. 4. El modelo de Asignación D-R generado por las RNAs combina lo mejor del sistema MELD con el Beneficio de Supervivencia Global, usando para ello un sistema basado en reglas, maximizando la utilidad de los injertos disponibles. Esto los convierte en sistemas complementarios para un mismo fin, en lugar de considerarlos competitivos
Tracking Foodborne Pathogens from Farm to Table: Data Needs to Evaluate Control Options
Food safety policymakers and scientists came together at a conference in January 1995 to evaluate data available for analyzing control of foodborne microbial pathogens. This proceedings starts with data regarding human illnesses associated with foodborne pathogens and moves backwards in the food chain to examine pathogen data in the processing sector and at the farm level. Of special concern is the inability to link pathogen data throughout the food chain. Analytical tools to evaluate the impact of changing production and consumption practices on foodborne disease risks and their economic consequences are presented. The available data are examined to see how well they meet current analytical needs to support policy analysis. The policymaker roundtable highlights the tradeoffs involved in funding databases, the economic evaluation of USDA's Hazard Analysis Critical Control Point (HACCP) proposal and other food safety policy issues, and the necessity of a multidisciplinary approach toward improving food safety databases.food safety, cost benefit analysis, foodborne disease risk, foodborne pathogens, Hazard Analysis Critical Control Point (HACCP), probabilistic scenario analysis, fault-tree analysis, Food Consumption/Nutrition/Food Safety,
Proceedings. 19. Workshop Computational Intelligence, Dortmund, 2. - 4. Dezember 2009
Dieser Tagungsband enthält die Beiträge des 19. Workshops „Computational Intelligence“ des Fachausschusses 5.14 der VDI/VDE-Gesellschaft für Mess- und Automatisierungstechnik (GMA) und der Fachgruppe „Fuzzy-Systeme und Soft-Computing“ der Gesellschaft für Informatik (GI), der vom 2.-4. Dezember 2009 im Haus Bommerholz bei Dortmund stattfindet
Multi-Fidelity Bayesian Optimization for Efficient Materials Design
Materials design is a process of identifying compositions and structures to achieve
desirable properties. Usually, costly experiments or simulations are required to evaluate
the objective function for a design solution. Therefore, one of the major challenges is how
to reduce the cost associated with sampling and evaluating the objective. Bayesian
optimization is a new global optimization method which can increase the sampling
efficiency with the guidance of the surrogate of the objective. In this work, a new
acquisition function, called consequential improvement, is proposed for simultaneous
selection of the solution and fidelity level of sampling. With the new acquisition function,
the subsequent iteration is considered for potential selections at low-fidelity levels, because
evaluations at the highest fidelity level are usually required to provide reliable objective
values. To reduce the number of samples required to train the surrogate for molecular
design, a new recursive hierarchical similarity metric is proposed. The new similarity
metric quantifies the differences between molecules at multiple levels of hierarchy
simultaneously based on the connections between multiscale descriptions of the structures.
The new methodologies are demonstrated with simulation-based design of materials and
structures based on fully atomistic and coarse-grained molecular dynamics simulations,
and finite-element analysis. The new similarity metric is demonstrated in the design of
tactile sensors and biodegradable oligomers. The multi-fidelity Bayesian optimization
method is also illustrated with the multiscale design of a piezoelectric transducer by
concurrently optimizing the atomic composition of the aluminum titanium nitride ceramic
and the device’s porous microstructure at the micrometer scale.Ph.D
Clustering Methods for Requirements Selection and Optimisation
Decisions about which features to include in a new system or the next release of an existing one are critical to the success of software products. Such decisions should be informed by the needs of the users and stakeholders. But how can we make such decisions when the number of potential features and the number of individual stakeholders are very large? This problem is particularly important when stakeholders’ needs are gathered online through the use of discussion forums and web-based feature request management systems. Existing requirements decision-making techniques are not adequate in this context because they do not scale well to such large numbers of feature requests or stakeholders. This thesis addresses this problem by presenting and evaluating clustering methods to facilitate requirements selection and optimization when requirements preferences are elicited from a very large number of stakeholders. Firstly, it presents a novel method for identifying groups of stakeholders with similar preferences for requirements. It computes the representative preferences for the resulting groups and provides additional insights in trends and divergences in stakeholders’ preferences which may be used to aid the decision making process. Secondly, it presents a method to help decision-makers identify key similarities and differences among large sets of optimal design decisions. The benefits of these techniques are demonstrated on two real-life projects - one concerned with selecting features for mobile phones and the other concerned with selecting requirements for a rights and access management system