8 research outputs found

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Automatic refinement of large-scale cross-domain knowledge graphs

    Get PDF
    Knowledge graphs are a way to represent complex structured and unstructured information integrated into an ontology, with which one can reason about the existing information to deduce new information or highlight inconsistencies. Knowledge graphs are divided into the terminology box (TBox), also known as ontology, and the assertions box (ABox). The former consists of a set of schema axioms defining classes and properties which describe the data domain. Whereas the ABox consists of a set of facts describing instances in terms of the TBox vocabulary. In the recent years, there have been several initiatives for creating large-scale cross-domain knowledge graphs, both free and commercial, with DBpedia, YAGO, and Wikidata being amongst the most successful free datasets. Those graphs are often constructed with the extraction of information from semi-structured knowledge, such as Wikipedia, or unstructured text from the web using NLP methods. It is unlikely, in particular when heuristic methods are applied and unreliable sources are used, that the knowledge graph is fully correct or complete. There is a tradeoff between completeness and correctness, which is addressed differently in each knowledge graph’s construction approach. There is a wide variety of applications for knowledge graphs, e.g. semantic search and discovery, question answering, recommender systems, expert systems and personal assistants. The quality of a knowledge graph is crucial for its applications. In order to further increase the quality of such large-scale knowledge graphs, various automatic refinement methods have been proposed. Those methods try to infer and add missing knowledge to the graph, or detect erroneous pieces of information. In this thesis, we investigate the problem of automatic knowledge graph refinement and propose methods that address the problem from two directions, automatic refinement of the TBox and of the ABox. In Part I we address the ABox refinement problem. We propose a method for predicting missing type assertions using hierarchical multilabel classifiers and ingoing/ outgoing links as features. We also present an approach to detection of relation assertion errors which exploits type and path patterns in the graph. Moreover, we propose an approach to correction of relation errors originating from confusions between entities. Also in the ABox refinement direction, we propose a knowledge graph model and process for synthesizing knowledge graphs for benchmarking ABox completion methods. In Part II we address the TBox refinement problem. We propose methods for inducing flexible relation constraints from the ABox, which are expressed using SHACL.We introduce an ILP refinement step which exploits correlations between numerical attributes and relations in order to the efficiently learn Horn rules with numerical attributes. Finally, we investigate the introduction of lexical information from textual corpora into the ILP algorithm in order to improve quality of induced class expressions

    Multi-Criteria Inventory Classification and Root Cause Analysis Based on Logical Analysis of Data

    Get PDF
    RÉSUMÉ : La gestion des stocks de pièces de rechange donne un avantage concurrentiel vital dans de nombreuses industries, en passant par les entreprises à forte intensité capitalistique aux entreprises de service. En raison de la quantité élevée d'unités de gestion des stocks (UGS) distinctes, il est presque impossible de contrôler les stocks sur une base unitaire ou de porter la même attention à toutes les pièces. La gestion des stocks de pièces de rechange implique plusieurs intervenants soit les fabricants d'équipement d'origine (FEO), les distributeurs et les clients finaux, ce qui rend la gestion encore plus complexe. Des pièces de rechange critiques mal classées et les ruptures de stocks de pièces critiques ont des conséquences graves. Par conséquent il est essentiel de classifier les stocks de pièces de rechange dans des classes appropriées et d'employer des stratégies de contrôle conformes aux classes respectives. Une classification ABC et certaines techniques de contrôle des stocks sont souvent appliquées pour faciliter la gestion UGS. La gestion des stocks de pièces de rechange a pour but de fournir des pièces de rechange au moment opportun. La classification des pièces de rechange dans des classes de priorité ou de criticité est le fondement même de la gestion à grande échelle d’un assortiment très varié de pièces. L'objectif de la classification est de classer systématiquement les pièces de rechange en différentes classes et ce en fonction de la similitude des pièces tout en considérant leurs caractéristiques exposées sous forme d'attributs. L'analyse ABC traditionnelle basée sur le principe de Pareto est l'une des techniques les plus couramment utilisées pour la classification. Elle se concentre exclusivement sur la valeur annuelle en dollar et néglige d'autres facteurs importants tels que la fiabilité, les délais et la criticité. Par conséquent l’approche multicritères de classification de l'inventaire (MCIC) est nécessaire afin de répondre à ces exigences. Nous proposons une technique d'apprentissage machine automatique et l'analyse logique des données (LAD) pour la classification des stocks de pièces de rechange. Le but de cette étude est d'étendre la méthode classique de classification ABC en utilisant une approche MCIC. Profitant de la supériorité du LAD dans les modèles de transparence et de fiabilité, nous utilisons deux exemples numériques pour évaluer l'utilisation potentielle du LAD afin de détecter des contradictions dans la classification de l'inventaire et de la capacité sur MCIC. Les deux expériences numériques ont démontré que LAD est non seulement capable de classer les stocks mais aussi de détecter et de corriger les observations contradictoires en combinant l’analyse des causes (RCA). La précision du test a été potentiellement amélioré, non seulement par l’utilisation du LAD, mais aussi par d'autres techniques de classification d'apprentissage machine automatique tels que : les réseaux de neurones (ANN), les machines à vecteurs de support (SVM), des k-plus proches voisins (KNN) et Naïve Bayes (NB). Enfin, nous procédons à une analyse statistique afin de confirmer l'amélioration significative de la précision du test pour les nouveaux jeux de données (corrections par LAD) en comparaison aux données d'origine. Ce qui s’avère vrai pour les cinq techniques de classification. Les résultats de l’analyse statistique montrent qu'il n'y a pas eu de différence significative dans la précision du test quant aux cinq techniques de classification utilisées, en comparant les données d’origine avec les nouveaux jeux de données des deux inventaires.----------ABSTRACT : Spare parts inventory management plays a vital role in maintaining competitive advantages in many industries, from capital intensive companies to service networks. Due to the massive quantity of distinct Stock Keeping Units (SKUs), it is almost impossible to control inventory by individual item or pay the same attention to all items. Spare parts inventory management involves all parties, from Original Equipment Manufacturer (OEM), to distributors and end customers, which makes this management even more challenging. Wrongly classified critical spare parts and the unavailability of those critical items could have severe consequences. Therefore, it is crucial to classify inventory items into classes and employ appropriate control policies conforming to the respective classes. An ABC classification and certain inventory control techniques are often applied to facilitate SKU management. Spare parts inventory management intends to provide the right spare parts at the right time. The classification of spare parts into priority or critical classes is the foundation for managing a large-scale and highly diverse assortment of parts. The purpose of classification is to consistently classify spare parts into different classes based on the similarity of items with respect to their characteristics, which are exhibited as attributes. The traditional ABC analysis, based on Pareto's Principle, is one of the most widely used techniques for classification, which concentrates exclusively on annual dollar usage and overlooks other important factors such as reliability, lead time, and criticality. Therefore, multi-criteria inventory classification (MCIC) methods are required to meet these demands. We propose a pattern-based machine learning technique, the Logical Analysis of Data (LAD), for spare parts inventory classification. The purpose of this study is to expand the classical ABC classification method by using a MCIC approach. Benefiting from the superiority of LAD in pattern transparency and robustness, we use two numerical examples to investigate LAD’s potential usage for detecting inconsistencies in inventory classification and the capability on MCIC. The two numerical experiments have demonstrated that LAD is not only capable of classifying inventory, but also for detecting and correcting inconsistent observations by combining it with the Root Cause Analysis (RCA) procedure. Test accuracy improves potentially not only with the LAD technique, but also with other major machine learning classification techniques, namely artificial neural network (ANN), support vector machines (SVM), k-nearest neighbours (KNN) and Naïve Bayes (NB). Finally, we conduct a statistical analysis to confirm the significant improvement in test accuracy for new datasets (corrections by LAD) compared to original datasets. This is true for all five classification techniques. The results of statistical tests demonstrate that there is no significant difference in test accuracy in five machine learning techniques, either in the original or the new datasets of both inventories

    classification of oncologic data with genetic programming

    Get PDF
    Discovering the models explaining the hidden relationship between genetic material and tumor pathologies is one of the most important open challenges in biology and medicine. Given the large amount of data made available by the DNA Microarray technique, Machine Learning is becoming a popular tool for this kind of investigations. In the last few years, we have been particularly involved in the study of Genetic Programming for mining large sets of biomedical data. In this paper, we present a comparison between four variants of Genetic Programming for the classification of two different oncologic datasets: the first one contains data from healthy colon tissues and colon tissues affected by cancer; the second one contains data from patients affected by two kinds of leukemia (acute myeloid leukemia and acute lymphoblastic leukemia). We report experimental results obtained using two different fitness criteria: the receiver operating characteristic and the percentage of correctly classified instances. These results, and their comparison with the ones obtained by three nonevolutionary Machine Learning methods (Support Vector Machines, MultiBoosting, and Random Forests) on the same data, seem to hint that Genetic Programming is a promising technique for this kind of classification

    A Field Guide to Genetic Programming

    Get PDF
    xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction -- Representation, initialisation and operators in Tree-based GP -- Getting ready to run genetic programming -- Example genetic programming run -- Alternative initialisations and operators in Tree-based GP -- Modular, grammatical and developmental Tree-based GP -- Linear and graph genetic programming -- Probalistic genetic programming -- Multi-objective genetic programming -- Fast and distributed genetic programming -- GP theory and its applications -- Applications -- Troubleshooting GP -- Conclusions.Contents xi 1 Introduction 1.1 Genetic Programming in a Nutshell 1.2 Getting Started 1.3 Prerequisites 1.4 Overview of this Field Guide I Basics 2 Representation, Initialisation and GP 2.1 Representation 2.2 Initialising the Population 2.3 Selection 2.4 Recombination and Mutation Operators in Tree-based 3 Getting Ready to Run Genetic Programming 19 3.1 Step 1: Terminal Set 19 3.2 Step 2: Function Set 20 3.2.1 Closure 21 3.2.2 Sufficiency 23 3.2.3 Evolving Structures other than Programs 23 3.3 Step 3: Fitness Function 24 3.4 Step 4: GP Parameters 26 3.5 Step 5: Termination and solution designation 27 4 Example Genetic Programming Run 4.1 Preparatory Steps 29 4.2 Step-by-Step Sample Run 31 4.2.1 Initialisation 31 4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming 5 Alternative Initialisations and Operators in 5.1 Constructing the Initial Population 5.1.1 Uniform Initialisation 5.1.2 Initialisation may Affect Bloat 5.1.3 Seeding 5.2 GP Mutation 5.2.1 Is Mutation Necessary? 5.2.2 Mutation Cookbook 5.3 GP Crossover 5.4 Other Techniques 32 5.5 Tree-based GP 39 6 Modular, Grammatical and Developmental Tree-based GP 47 6.1 Evolving Modular and Hierarchical Structures 47 6.1.1 Automatically Defined Functions 48 6.1.2 Program Architecture and Architecture-Altering 50 6.2 Constraining Structures 51 6.2.1 Enforcing Particular Structures 52 6.2.2 Strongly Typed GP 52 6.2.3 Grammar-based Constraints 53 6.2.4 Constraints and Bias 55 6.3 Developmental Genetic Programming 57 6.4 Strongly Typed Autoconstructive GP with PushGP 59 7 Linear and Graph Genetic Programming 61 7.1 Linear Genetic Programming 61 7.1.1 Motivations 61 7.1.2 Linear GP Representations 62 7.1.3 Linear GP Operators 64 7.2 Graph-Based Genetic Programming 65 7.2.1 Parallel Distributed GP (PDGP) 65 7.2.2 PADO 67 7.2.3 Cartesian GP 67 7.2.4 Evolving Parallel Programs using Indirect Encodings 68 8 Probabilistic Genetic Programming 8.1 Estimation of Distribution Algorithms 69 8.2 Pure EDA GP 71 8.3 Mixing Grammars and Probabilities 74 9 Multi-objective Genetic Programming 75 9.1 Combining Multiple Objectives into a Scalar Fitness Function 75 9.2 Keeping the Objectives Separate 76 9.2.1 Multi-objective Bloat and Complexity Control 77 9.2.2 Other Objectives 78 9.2.3 Non-Pareto Criteria 80 9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80 9.4 Multi-objective Optimisation via Operator Bias 81 10 Fast and Distributed Genetic Programming 83 10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83 10.2 Reducing Cost of Fitness with Caches 86 10.3 Parallel and Distributed GP are Not Equivalent 88 10.4 Running GP on Parallel Hardware 89 10.4.1 Master–slave GP 89 10.4.2 GP Running on GPUs 90 10.4.3 GP on FPGAs 92 10.4.4 Sub-machine-code GP 93 10.5 Geographically Distributed GP 93 11 GP Theory and its Applications 97 11.1 Mathematical Models 98 11.2 Search Spaces 99 11.3 Bloat 101 11.3.1 Bloat in Theory 101 11.3.2 Bloat Control in Practice 104 III Practical Genetic Programming 12 Applications 12.1 Where GP has Done Well 12.2 Curve Fitting, Data Modelling and Symbolic Regression 12.3 Human Competitive Results – the Humies 12.4 Image and Signal Processing 12.5 Financial Trading, Time Series, and Economic Modelling 12.6 Industrial Process Control 12.7 Medicine, Biology and Bioinformatics 12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii 12.9 Entertainment and Computer Games 127 12.10The Arts 127 12.11Compression 128 13 Troubleshooting GP 13.1 Is there a Bug in the Code? 13.2 Can you Trust your Results? 13.3 There are No Silver Bullets 13.4 Small Changes can have Big Effects 13.5 Big Changes can have No Effect 13.6 Study your Populations 13.7 Encourage Diversity 13.8 Embrace Approximation 13.9 Control Bloat 13.10 Checkpoint Results 13.11 Report Well 13.12 Convince your Customers 14 Conclusions Tricks of the Trade A Resources A.1 Key Books A.2 Key Journals A.3 Key International Meetings A.4 GP Implementations A.5 On-Line Resources 145 B TinyGP 151 B.1 Overview of TinyGP 151 B.2 Input Data Files for TinyGP 153 B.3 Source Code 154 B.4 Compiling and Running TinyGP 162 Bibliography 167 Inde

    Field Guide to Genetic Programming

    Get PDF

    Optimisation algorithms inspired from modelling of bacterial foraging patterns and their applications

    Get PDF
    Research in biologically-inspired optimisation has been fl<;lurishing over the past decades. This approach adopts a bott0!ll-up viewpoint to understand and mimic certain features of a biological system. It has been proved useful in developing nondeterministic algorithms, such as Evolutionary Algorithms (EAs) and Swarm Intelligence (SI). Bacteria, as the simplest creature in nature, are of particular interest in recent studies. In the past thousands of millions of years, bacteria have exhibited a self-organising behaviour to cope with the natural selection. For example, bacteria have developed a number of strategies to search for food sources with a very efficient manner. This thesis explores the potential of understanding of a biological system by modelling the' underlying mechanisms of bacterial foraging patterns and investigates their applicability to engineering optimisation problems. :rvlodelling plays a significant role in understanding bacterial foraging behaviour. Mathematical expressions and experimental observations have been utilised to represent biological systems. However, difficulties arise from the lack of systematic analysis of the developed models and experimental data. Recently, Systems Biology has be,en proposed to overcome this barrier, with the effort from a number of research fields, including Computer Science and Systems Engineering. At the same time, Individual-based Modelling (IbM) has emerged to assist the modelling of a biological system. Starting from a basic model of foraging and proliferation of bacteria, the development of an IbM of bacterial systems of this thesis focuses on a Varying Environment BActerial Model (VEBAM). Simulation results demonstrate that VEBAM is able to provide a new perspective to describe interactions between the bacteria and their food environment. Knowledge transfer from modelling of bacterial systems to solving optimisation problems also composes an important part of this study. Three Bacteriainspired Algorithms (BalAs) have been developed to bridge the gap between modelling and optimisation. These algorithms make use of the. self-adaptability of individual bacteria in the group searching activities described in VEBAM, while incorporating a variety of additional features. In particular, the new bacterial foraging algorithm with varying population (BFAVP) takes bacterial metabolism into consideration. The group behaviour in Particle Swarm Optimiser (PSO) is adopted in Bacterial Swarming Algorithm (BSA) to enhance searching ability. To reduce computational time, another algorithm, a Paired-bacteria Optimiser (PBO) is designed specifically to further explore the capability of BalAs. Simulation studies undertaken against a wide range of benchmark functions demonstrate a satisfying performance with a reasonable convergence speed. To explore the potential of bacterial searching ability in optimisation undertaken in a varying environment, a dynamic bacterial foraging algorithm (DBFA) is developed with the aim of solving optimisation in a time-varying environment. In this case, the balance between its convergence and exploration abilities is investigated, and a new scheme of reproduction is developed which is different froin that used for static optimisation problems. The simulation studies have been undertaken and the results show that the DBFA can adapt to various environmental changes rapidly. One of the challenging large-scale complex optimisation problems is optimal power flow (OPF) computation. BFAVP shows its advantage in solving this problem. A simulation study has been performed on an IEEE 30-bus system, and the results are compared with PSO algorithm and Fast Evolutionary Programming (FEP) algorithm, respectively. Furthermore, the OPF problem is extended for consideration in varying environments, on which DBFA has been evaluated. A simulation study has been undertaken on both the IEEE 30-bus system and the IEEE l1S-bus system, in compariso~ with a number of existing algorithms. The dynamic OPF problem has been tackled for the first time in the area of power systems, and the results obtained are encouraging, with a significant amount of energy could possibly being saved. Another application of BaIA in this thesis is concerned with estimating optimal parameters of a power transformer winding model using BSA. Compared with Genetic Algorithm (GA), BSA is able to obtain a more satisfying result in modelling the transformer winding, which could not be achieved using a theoretical transfer function model

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
    corecore