19 research outputs found

    Compilation and Automatic Parallelisation of Functional Code for Data-Parallel Architectures

    Get PDF

    GPU Computing for Cognitive Robotics

    Get PDF
    This thesis presents the first investigation of the impact of GPU computing on cognitive robotics by providing a series of novel experiments in the area of action and language acquisition in humanoid robots and computer vision. Cognitive robotics is concerned with endowing robots with high-level cognitive capabilities to enable the achievement of complex goals in complex environments. Reaching the ultimate goal of developing cognitive robots will require tremendous amounts of computational power, which was until recently provided mostly by standard CPU processors. CPU cores are optimised for serial code execution at the expense of parallel execution, which renders them relatively inefficient when it comes to high-performance computing applications. The ever-increasing market demand for high-performance, real-time 3D graphics has evolved the GPU into a highly parallel, multithreaded, many-core processor extraordinary computational power and very high memory bandwidth. These vast computational resources of modern GPUs can now be used by the most of the cognitive robotics models as they tend to be inherently parallel. Various interesting and insightful cognitive models were developed and addressed important scientific questions concerning action-language acquisition and computer vision. While they have provided us with important scientific insights, their complexity and application has not improved much over the last years. The experimental tasks as well as the scale of these models are often minimised to avoid excessive training times that grow exponentially with the number of neurons and the training data. This impedes further progress and development of complex neurocontrollers that would be able to take the cognitive robotics research a step closer to reaching the ultimate goal of creating intelligent machines. This thesis presents several cases where the application of the GPU computing on cognitive robotics algorithms resulted in the development of large-scale neurocontrollers of previously unseen complexity enabling the conducting of the novel experiments described herein.European Commission Seventh Framework Programm

    SIMD@OpenMP : a programming model approach to leverage SIMD features

    Get PDF
    SIMD instruction sets are a key feature in current general purpose and high performance architectures. SIMD instructions apply in parallel the same operation to a group of data, commonly known as vector. A single SIMD/vector instruction can, thus, replace a sequence of scalar instructions. Consequently, the number of instructions can be greatly reduced leading to improved execution times. However, SIMD instructions are not widely exploited by the vast majority of programmers. In many cases, taking advantage of these instructions relies on the compiler. Nevertheless, compilers struggle with the automatic vectorization of codes. Advanced programmers are then compelled to exploit SIMD units by hand, using low-level hardware-specific intrinsics. This approach is cumbersome, error prone and not portable across SIMD architectures. This thesis targets OpenMP to tackle the underuse of SIMD instructions from three main areas of the programming model: language constructions, compiler code optimizations and runtime algorithms. We choose the Intel Xeon Phi coprocessor (Knights Corner) and its 512-bit SIMD instruction set for our evaluation process. We make four contributions aimed at improving the exploitation of SIMD instructions in this scope. Our first contribution describes a compiler vectorization infrastructure suitable for OpenMP. This infrastructure targets for-loops and whole functions. We define a set of attributes for expressions that determine how the code is vectorized. Our vectorization infrastructure also implements support for several advanced vector features. This infrastructure is proven to be effective in the vectorization of complex codes and it is the basis upon which we build the following two contributions. The second contribution introduces a proposal to extend OpenMP 3.1 with SIMD parallelism. Essential parts of this work have become key features of the SIMD proposal included in OpenMP 4.0. We define the "simd" and "simd for" directives that allow programmers to describe SIMD parallelism of loops and whole functions. Furthermore, we propose a set of optional clauses that leads the compiler to generate a more efficient vector code. These SIMD extensions improve the programming efficiency when exploiting SIMD resources. Our evaluation on the Intel Xeon Phi coprocessor shows that our SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler. In the third contribution, we propose a vector code optimization that enhances overlapped vector loads. These vector loads redundantly read from memory scalar elements already loaded by other vector loads. Our vector code optimization improves the memory usage of these accesses by means of building a vector register cache and exploiting register-to-register instructions. Our proposal also includes a new clause (overlap) in the context of the SIMD extensions for OpenMP of our first contribution. This new clause allows enabling, disabling and tuning this optimization on demand. The last contribution tackles the exploitation of SIMD instructions in the OpenMP barrier and reduction primitives. We propose a new combined barrier and reduction tree scheme specifically designed to make the most of SIMD instructions. Our barrier algorithm takes advantage of simultaneous multi-threading technology (SMT) and it utilizes SIMD memory instructions in the synchronization process. The four contributions of this thesis are an important step in the direction of a more common and generalized use of SIMD instructions. Our work is having an outstanding impact on the whole OpenMP community, ranging from users of the programming model to compiler and runtime implementations. Our proposals in the context of OpenMP improves the programmability of the programming model, the overhead of runtime services and the execution time of applications by means of a better use of SIMD.Los juegos de instrucciones SIMD son un componente clave en las arquitecturas de propósito general y de alto rendimiento actuales. Estas instrucciones aplican en paralelo la misma operación a un conjunto de datos, conocido como vector. Una instrucción SIMD/vectorial puede sustituir una secuencia de instrucciones escalares. Así, el número de instrucciones puede ser reducido considerablemente, dando lugar a mejores tiempos de ejecución. No obstante, las instrucciones SIMD no son explotadas ampliamente por la mayoría de programadores. En general, beneficiarse de estas instrucciones depende del compilador. Sin embargo, los compiladores tienen dificultades con la vectorización automática de códigos por lo que los programadores avanzados se ven obligados a explotar las unidades SIMD manualmente, empleando intrínsecas de bajo nivel específicas del hardware. Esta aproximación es costosa, propensa a errores y no portable entre arquitecturas. Esta tesis se centra en el modelo de programación OpenMP para abordar el poco uso de las instrucciones SIMD desde tres áreas: construcciones del lenguaje, optimizaciones de código del compilador y algoritmos del runtime. Hemos escogido el coprocesador Intel Xeon Phi (Knights Corner) y su juego de instrucciones SIMD de 512 bits para nuestra evaluación. Realizamos cuatro contribuciones para mejorar la explotación de las instrucciones SIMD en este ámbito. Nuestra primera contribución describe una infraestructura de vectorización de compilador adecuada para OpenMP. Esta infraestructura tiene como objetivo la vectorización de bucles y funciones. Para ello definimos un conjunto de atributos que determina como se vectoriza el código. Nuestra evaluación demuestra la efectividad de esta infraestructura en la vectorización de códigos complejos. Esta infraestructura es la base de las dos propuestas siguientes. En la segunda contribución proponemos una extensión SIMD para de OpenMP 3.1. Partes esenciales de este trabajo se han convertido en características clave de la propuesta sobre SIMD incluida en OpenMP 4.0. Definimos las directivas ‘simd’ y ‘simd for’ que permiten a los programadores describir paralelismo SIMD de bucles y funciones. Además, proponemos un conjunto de cláusulas opcionales que permiten que el compilador genere código vectorial más eficiente. Nuestra evaluación muestra que nuestra propuesta SIMD permite al compilador vectorizar eficientemente códigos pobremente o no vectorizados automáticamente con el compilador Intel C/C++

    High performance computing for 3D image segmentation

    Get PDF
    Digital image processing is a very popular and still very promising eld of science, which has been successfully applied to numerous areas and problems, reaching elds like forensic analysis, security systems, multimedia processing, aerospace, automotive, and many more. A very important part of the image processing area is image segmentation. This refers to the task of partitioning a given image into multiple regions and is typically used to locate and mark objects and boundaries in input scenes. After segmentation the image represents a set of data far more suitable for further algorithmic processing and decision making. Image segmentation algorithms are a very broad eld and they have received signi cant amount of research interest A good example of an area, in which image processing plays a constantly growing role, is the eld of medical solutions. The expectations and demands that are presented in this branch of science are very high and dif cult to meet for the applied technology. The problems are challenging and the potential bene ts are signi cant and clearly visible. For over thirty years image processing has been applied to different problems and questions in medicine and the practitioners have exploited the rich possibilities that it offered. As a result, the eld of medicine has seen signi cant improvements in the interpretation of examined medical data. Clearly, the medical knowledge has also evolved signi cantly over these years, as well as the medical equipment that serves doctors and researchers. Also the common computer hardware, which is present at homes, of ces and laboratories, is constantly evolving and changing. All of these factors have sculptured the shape of modern image processing techniques and established in which ways it is currently used and developed. Modern medical image processing is centered around 3D images with high spatial and temporal resolution, which can bring a tremendous amount of data for medical practitioners. Processing of such large sets of data is not an easy task, requiring high computational power. Furthermore, in present times the computational power is not as easily available as in recent years, as the growth of possibilities of a single processing unit is very limited - a trend towards multi-unit processing and parallelization of the workload is clearly visible. Therefore, in order to continue the development of more complex and more advanced image processing techniques, a new direction is necessary. A very interesting family of image segmentation algorithms, which has been gaining a lot of focus in the last three decades, is called Deformable Models. They are based on the concept of placing a geometrical object in the scene of interest and deforming it until it assumes the shape of objects of interest. This process is usually guided by several forces, which originate in mathematical functions, features of the input images and other constraints of the deformation process, like object curvature or continuity. A range of very desired features of Deformable Models include their high capability for customization and specialization for different tasks and also extensibility with various approaches for prior knowledge incorporation. This set of characteristics makes Deformable Models a very ef cient approach, which is capable of delivering results in competitive times and with very good quality of segmentation, robust to noisy and incomplete data. However, despite the large amount of work carried out in this area, Deformable Models still suffer from a number of drawbacks. Those that have been gaining the most focus are e.g. sensitivity to the initial position and shape of the model, sensitivity to noise in the input images and to awed input data, or the need for user supervision over the process. The work described in this thesis aims at addressing the problems of modern image segmentation, which has raised from the combination of above-mentioned factors: the signi cant growth of image volumes sizes, the growth of complexity of image processing algorithms, coupled with the change in processor development and turn towards multi-processing units instead of growing bus speeds and the number of operations per second of a single processing unit. We present our innovative model for 3D image segmentation, called the The Whole Mesh Deformation model, which holds a set of very desired features that successfully address the above-mentioned requirements. Our model has been designed speci cally for execution on parallel architectures and with the purpose of working well with very large 3D images that are created by modern medical acquisition devices. Our solution is based on Deformable Models and is characterized by a very effective and precise segmentation capability. The proposed Whole Mesh Deformation (WMD) model uses a 3D mesh instead of a contour or a surface to represent the segmented shapes of interest, which allows exploiting more information in the image and obtaining results in shorter times. The model offers a very good ability for topology changes and allows effective parallelization of work ow, which makes it a very good choice for large data-sets. In this thesis we present a precise model description, followed by experiments on arti cial images and real medical data

    High-level compiler analysis for OpenMP

    Get PDF
    Nowadays, applications from dissimilar domains, such as high-performance computing and high-integrity systems, require levels of performance that can only be achieved by means of sophisticated heterogeneous architectures. However, the complex nature of such architectures hinders the production of efficient code at acceptable levels of time and cost. Moreover, the need for exploiting parallelism adds complications of its own (e.g., deadlocks, race conditions,...). In this context, compiler analysis is fundamental for optimizing parallel programs. There is however a trade-off between complexity and profit: low complexity analyses (e.g., reaching definitions) provide information that may be insufficient for many relevant transformations, and complex analyses based on mathematical representations (e.g., polyhedral model) give accurate results at a high computational cost. A range of parallel programming models providing different levels of programmability, performance and portability enable the exploitation of current architectures. However, OpenMP has proved many advantages over its competitors: 1) it delivers levels of performance comparable to highly tunable models such as CUDA and MPI, and better robustness than low level libraries such as Pthreads; 2) the extensions included in the latest specification meet the characteristics of current heterogeneous architectures (i.e., the coupling of a host processor to one or more accelerators, and the capability of expressing fine-grained, both structured and unstructured, and highly-dynamic task parallelism); 3) OpenMP is widely implemented by several chip (e.g., Kalray MPPA, Intel) and compiler (e.g., GNU, Intel) vendors; and 4) although currently the model lacks resiliency and reliability mechanisms, many works, including this thesis, pursue their introduction in the specification. This thesis addresses the study of compiler analysis techniques for OpenMP with two main purposes: 1) enhance the programmability and reliability of OpenMP, and 2) prove OpenMP as a suitable model to exploit parallelism in safety-critical domains. Particularly, the thesis focuses on the tasking model because it offers the flexibility to tackle the parallelization of algorithms with load imbalance, recursiveness and uncountable loop based kernels. Additionally, current works have proved the time-predictability of this model, shortening the distance towards its introduction in safety-critical domains. To enable the analysis of applications using the OpenMP tasking model, the first contribution of this thesis is the extension of a set of classic compiler techniques with support for OpenMP. As a basis for including reliability mechanisms, the second contribution consists of the development of a series of algorithms to statically detect situations involving OpenMP tasks, which may lead to a loss of performance, non-deterministic results or run-time failures. A well-known problem of parallel processing related to compilers is the static scheduling of a program represented by a directed graph. Although the literature is extensive in static scheduling techniques, the work related to the generation of the task graph at compile-time is very scant. Compilers are limited by the knowledge they can extract, which depends on the application and the programming model. The third contribution of this thesis is the generation of a predicated task dependency graph for OpenMP that can be interpreted by the runtime in such a way that the cost of solving dependences is reduced to the minimum. With the previous contributions as a basis for determining the functional safety of OpenMP, the final contribution of this thesis is the adaptation of OpenMP to the safety-critical domain considering two directions: 1) indicating how OpenMP can be safely used in such a domain, and 2) integrating OpenMP into Ada, a language widely used in the safety-critical domain.Actualment, aplicacions de dominis diversos com la computació d'altes prestacions i els sistemes d'alta integritat, requereixen nivells de rendiment assolibles només mitjançant arquitectures heterogènies sofisticades. No obstant, la natura complexa d'aquestes dificulta la producció de codi eficient en un temps i cost acceptables. A més, la necessitat d’explotar paral·lelisme introdueix complicacions en sí mateixa (p. ex. bloqueig mutu, condicions de carrera,...). En aquest context, l'anàlisi de compiladors és fonamental per optimitzar programes paral·lels. Existeix però un equilibri entre complexitat i beneficis: la informació obtinguda amb anàlisis simples (p. ex. definicions abastables) pot ser insuficient per moltes transformacions rellevants, i anàlisis complexos basats en models matemàtics (p. ex. model polièdric) faciliten resultats acurats a un alt cost computacional. Existeixen molts models de programació paral·lela que proporcionen diferents nivells de programabilitat, rendiment i portabilitat per l'explotació de les arquitectures actuals. En aquest marc, OpenMP ha demostrat molts avantatges respecte dels seus competidors: 1) el seu nivell de rendiment és comparable a models molt ajustables com CUDA i MPI, i proporciona més robustesa que llibreries de baix nivell com Pthreads; 2) les extensions que inclou la darrera especificació satisfan les característiques de les actuals arquitectures heterogènies (és a dir, l’acoblament d’un processador principal i un o més acceleradors, i la capacitat d'expressar paral·lelisme de tasques de gra fi, ja sigui estructurat o sense estructura; 3) OpenMP és àmpliament implementat per venedors de xips (p. ex. Kalray MPPA, Intel) i compiladors (p. ex. GNU, Intel); i 4) tot i que el model actual manca de mecanismes de resiliència i fiabilitat, molts treballs, incloent aquesta tesi, busquen la seva introducció a l'especificació. Aquesta tesi adreça l'estudi de tècniques d’anàlisi de compiladors amb dos objectius: 1) millorar la programabilitat i la fiabilitat de OpenMP, i 2) provar que OpenMP és un model adequat per explotar paral·lelisme en sistemes crítics. En particular, la tesi es centra en el model de tasques per què aquest ofereix la flexibilitat per abordar aplicacions amb problemes de balanceig de càrrega, recursivitat i bucles incomptables. A més, treballs recents han provat la predictibilitat en qüestió de temps del model, escurçant la distància cap a la seva introducció en sistemes crítics. Per a poder analitzar aplicacions que utilitzen el model de tasques d’OpenMP, la primera contribució d’aquesta tesi consisteix en l’extensió d'un conjunt de tècniques clàssiques de compilació per suportar OpenMP. Com a base per incloure mecanismes de fiabilitat, la segona contribució consisteix en el desenvolupament duna sèrie d'algorismes per detectar de forma estàtica situacions que involucren tasques d’OpenMP, i que poden conduir a una pèrdua de rendiment, resultats no deterministes, o fallades en temps d’execució. Un problema ben conegut del processament paral·lel relacionat amb els compiladors és la planificació estàtica d’un programa representat mitjançant un graf dirigit. Tot i que la literatura sobre planificació estàtica és extensa, aquella relacionada amb la generació del graf en temps de compilació és molt escassa. Els compiladors estan limitats pel coneixement que poden extreure, que depèn de l’aplicació i del model de programació. La tercera contribució de la tesi és la generació d’un graf de dependències enriquit que pot ser interpretat pel sistema en temps d’execució de manera que el cost de resoldre les dependències sigui mínim. Amb les anteriors contribucions com a base per a determinar la seguretat funcional de OpenMP, la darrera contribució de la tesi consisteix en adaptar OpenMP a sistemes crítics, explorant dues direccions: 1) indicar com OpenMP es pot utilitzar de forma segura en un domini com, i 2) integrar OpenMP en Ada, un llenguatge molt utilitzat en el domini de seguretat.Postprint (published version

    Multi-criteria optimization algorithms for high dose rate brachytherapy

    Get PDF
    L’objectif général de cette thèse est d’utiliser les connaissances en physique de la radiation, en programmation informatique et en équipement informatique à la haute pointe de la technologie pour améliorer les traitements du cancer. En particulier, l’élaboration d’un plan de traitement en radiothérapie peut être complexe et dépendant de l’utilisateur. Cette thèse a pour objectif de simplifier la planification de traitement actuelle en curiethérapie de la prostate à haut débit de dose (HDR). Ce projet a débuté à partir d’un algorithme de planification inverse largement utilisé, la planification de traitement inverse par recuit simulé (IPSA). Pour aboutir à un algorithme de planification inverse ultra-rapide et automatisé, trois algorithmes d’optimisation multicritères (MCO) ont été mis en oeuvre. Suite à la génération d’une banque de plans de traitement ayant divers compromis avec les algorithmes MCO, un plan de qualité a été automatiquement sélectionné. Dans la première étude, un algorithme MCO a été introduit pour explorer les frontières de Pareto en curiethérapie HDR. L’algorithme s’inspire de la fonctionnalité MCO intégrée au système Raystation (RaySearch Laboratories, Stockholm, Suède). Pour chaque cas, 300 plans de traitement ont été générés en série pour obtenir une approximation uniforme de la frontière de Pareto. Chaque plan optimal de Pareto a été calculé avec IPSA et chaque nouveau plan a été ajouté à la portion de la frontière de Pareto où la distance entre sa limite supérieure et sa limite inférieure était la plus grande. Dans une étude complémentaire, ou dans la seconde étude, un algorithme MCO basé sur la connaissance (kMCO) a été mis en oeuvre pour réduire le temps de calcul de l’algorithme MCO. Pour ce faire, deux stratégies ont été mises en oeuvre : une prédiction de l’espace des solutions cliniquement acceptables à partir de modèles de régression et d’un calcul parallèle des plans de traitement avec deux processeurs à six coeurs. En conséquence, une banque de plans de traitement de petite taille (14) a été générée et un plan a été sélectionné en tant que plan kMCO. L’efficacité de la planification et de la performance dosimétrique ont été comparées entre les plans approuvés par le médecin et les plans kMCO pour 236 cas. La troisième et dernière étude de cette thèse a été réalisée en coopération avec Cédric Bélanger. Un algorithme MCO (gMCO) basé sur l’utilisation d’un environnement de développement compatible avec les cartes graphiques a été mis en oeuvre pour accélérer davantage le calcul. De plus, un algorithme d’optimisation quasi-Newton a été implémenté pour remplacer le recuit simulé dans la première et la deuxième étude. De cette manière, un millier de plans de traitement avec divers compromis et équivalents à ceux générés par IPSA ont été calculés en parallèle. Parmi la banque de plans de traitement généré par l’agorithme gMCO, un plan a été sélectionné (plan gMCO). Le temps de planification et les résultats dosimétriques ont été comparés entre les plans approuvés par le médecin et les plans gMCO pour 457 cas. Une comparaison à grande échelle avec les plans approuvés par les radio-oncologues montre que notre dernier algorithme MCO (gMCO) peut améliorer l’efficacité de la planification du traitement (de quelques minutes à 9:4 s) ainsi que la qualité dosimétrique des plans de traitements (des plans passant de 92:6% à 99:8% selon les critères dosimétriques du groupe de traitement oncologique par radiation (RTOG)). Avec trois algorithmes MCO mis en oeuvre, cette thèse représente un effort soutenu pour développer un algorithme de planification inverse ultra-rapide, automatique et robuste en curiethérapie HDR.The overall purpose of this thesis is to use the knowledge of radiation physics, computer programming and computing hardware to improve cancer treatments. In particular, designing a treatment plan in radiation therapy can be complex and user-dependent, and this thesis aims to simplify current treatment planning in high dose rate (HDR) prostate brachytherapy. This project was started from a widely used inverse planning algorithm, Inverse Planning Simulated Annealing (IPSA). In order to eventually lead to an ultra-fast and automatic inverse planning algorithm, three multi-criteria optimization (MCO) algorithms were implemented. With MCO algorithms, a desirable plan was selected after computing a set of treatment plans with various trade-offs. In the first study, an MCO algorithm was introduced to explore the Pareto surfaces in HDR brachytherapy. The algorithm was inspired by the MCO feature integrated in the Raystation system (RaySearch Laboratories, Stockholm, Sweden). For each case, 300 treatment plans were serially generated to obtain a uniform approximation of the Pareto surface. Each Pareto optimal plan was computed with IPSA, and each new plan was added to the Pareto surface portion where the distance between its upper boundary and its lower boundary was the largest. In a companion study, or the second study, a knowledge-based MCO (kMCO) algorithm was implemented to shorten the computation time of the MCO algorithm. To achieve this, two strategies were implemented: a prediction of clinical relevant solution space with previous knowledge, and a parallel computation of treatment plans with two six-core CPUs. As a result, a small size (14) plan dataset was created, and one plan was selected as the kMCO plan. The planning efficiency and the dosimetric performance were compared between the physician-approved plans and the kMCO plans for 236 cases. The third and final study of this thesis was conducted in cooperation with Cédric Bélanger. A graphics processing units (GPU) based MCO (gMCO) algorithm was implemented to further speed up the computation. Furthermore, a quasi-Newton optimization engine was implemented to replace simulated annealing in the first and the second study. In this way, one thousand IPSA equivalent treatment plans with various trade-offs were computed in parallel. One plan was selected as the gMCO plan from the calculated plan dataset. The planning time and the dosimetric results were compared between the physician-approved plans and the gMCO plans for 457 cases. A large-scale comparison against the physician-approved plans shows that our latest MCO algorithm (gMCO) can result in an improved treatment planning efficiency (from minutes to 9:4 s) as well as an improved treatment plan dosimetric quality (Radiation Therapy Oncology Group (RTOG) acceptance rate from 92.6% to 99.8%). With three implemented MCO algorithms, this thesis represents a sustained effort to develop an ultra-fast, automatic and robust inverse planning algorithm in HDR brachytherapy

    Recognition of Japanese handwritten characters with Machine learning techniques

    Get PDF
    The recognition of Japanese handwritten characters has always been a challenge for researchers. A large number of classes, their graphic complexity, and the existence of three different writing systems make this problem particularly difficult compared to Western writing. For decades, attempts have been made to address the problem using traditional OCR (Optical Character Recognition) techniques, with mixed results. With the recent popularization of machine learning techniques through neural networks, this research has been revitalized, bringing new approaches to the problem. These new results achieve performance levels comparable to human recognition. Furthermore, these new techniques have allowed collaboration with very different disciplines, such as the Humanities or East Asian studies, achieving advances in them that would not have been possible without this interdisciplinary work. In this thesis, these techniques are explored until reaching a sufficient level of understanding that allows us to carry out our own experiments, training neural network models with public datasets of Japanese characters. However, the scarcity of public datasets makes the task of researchers remarkably difficult. Our proposal to minimize this problem is the development of a web application that allows researchers to easily collect samples of Japanese characters through the collaboration of any user. Once the application is fully operational, the examples collected until that point will be used to create a new dataset in a specific format. Finally, we can use the new data to carry out comparative experiments with the previous neural network models
    corecore