70 research outputs found

    Abstract Machine Models and Proxy Architectures for Exascale Computing

    Full text link

    GPU Computing for Cognitive Robotics

    Get PDF
    This thesis presents the first investigation of the impact of GPU computing on cognitive robotics by providing a series of novel experiments in the area of action and language acquisition in humanoid robots and computer vision. Cognitive robotics is concerned with endowing robots with high-level cognitive capabilities to enable the achievement of complex goals in complex environments. Reaching the ultimate goal of developing cognitive robots will require tremendous amounts of computational power, which was until recently provided mostly by standard CPU processors. CPU cores are optimised for serial code execution at the expense of parallel execution, which renders them relatively inefficient when it comes to high-performance computing applications. The ever-increasing market demand for high-performance, real-time 3D graphics has evolved the GPU into a highly parallel, multithreaded, many-core processor extraordinary computational power and very high memory bandwidth. These vast computational resources of modern GPUs can now be used by the most of the cognitive robotics models as they tend to be inherently parallel. Various interesting and insightful cognitive models were developed and addressed important scientific questions concerning action-language acquisition and computer vision. While they have provided us with important scientific insights, their complexity and application has not improved much over the last years. The experimental tasks as well as the scale of these models are often minimised to avoid excessive training times that grow exponentially with the number of neurons and the training data. This impedes further progress and development of complex neurocontrollers that would be able to take the cognitive robotics research a step closer to reaching the ultimate goal of creating intelligent machines. This thesis presents several cases where the application of the GPU computing on cognitive robotics algorithms resulted in the development of large-scale neurocontrollers of previously unseen complexity enabling the conducting of the novel experiments described herein.European Commission Seventh Framework Programm

    Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation

    Get PDF
    As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous coresLa caractérisation des applications en vue de les préparer pour les nouvelles architectures et les porter sur des systèmes très étendus est une étape importante pour pouvoir anticiper les modifications nécessaires. Comme les machines Exascale sont prévues pour la période 2018-2020, l'étude des applications et leur préparation pour ces machines s'avèrent donc essentielles. Nous nous intéressons aux applications d'imagerie sismique et en particulier à l'application Reverse Time Migration (RTM) car elle est très utilisée par les pétroliers dans le cadre de l'exploration sismique.La première partie de nos travaux a porté sur l'étude du cœur de calcul de l'application RTM qui consiste en un calcul de différences finies dans le domaine temporel (FDTD). Nous avons caractérisé cette partie de l'application en soulevant les aspects architecturaux des machines actuelles ayant un fort impact sur la performance, notamment les caches, les bandes passantes et le prefetching. Cette étude a abouti à l'élaboration d'un modèle de performance permettant de prédire le trafic DRAM des FDTD. La deuxième partie de la thèse se focalise sur l'impact de l'hétérogénéité et le parallélisme sur la FDTD et sur RTM. Nous avons choisi l'architecture manycore d’Intel, Xeon Phi, et nous avons étudié une implémentation "native" et une implémentation hétérogène et hybride, la version "symmetric". Enfin, nous avons porté l'application RTM sur un cluster hétérogène, Stampede du Texas Advanced Computing Center (TACC), où nous avons effectué des tests de scalabilité allant jusqu'à 64 nœuds contenant des coprocesseurs Xeon Phi et des processeurs Sandy Bridge ce qui correspond à presque 5000 cœur

    Towards scalable adaptive mesh refinement on future parallel architectures

    Get PDF
    In the march towards exascale, supercomputer architectures are undergoing a significant change. Limited by power consumption and heat dissipation, future supercomputers are likely to be built around a lower-power many-core model. This shift in supercomputer design will require sweeping code changes in order to take advantage of the highly-parallel architectures. Evolving or rewriting legacy applications to perform well on these machines is a significant challenge. Mini-applications, small computer programs that represent the performance characteristics of some larger application, can be used to investigate new programming models and improve the performance of the legacy application by proxy. These applications, being both easy to modify and representative, are essential for establishing a path to move legacy applications into the exascale era. The focus of the work presented in this thesis is the design, development and employment of a new mini-application, CleverLeaf, for shock hydro- dynamics with block-structured adaptive mesh refinement (AMR). We report on the development of CleverLeaf, and show how the fresh start provided by a mini-application can be used to develop an application that is flexible, accurate, and easy to employ in the investigation of exascale architectures. We also detail the development of the first reported resident parallel block-structured AMR library for Graphics Processing Units (GPUs). Extending the SAMRAI library using the CUDA programming model, we develop datatypes that store data only in GPU memory, as well the necessary operators for moving and interpolating data on an adaptive mesh. We show that executing AMR simulations on a GPU is up to 4.8⇥ faster than a CPU, and demonstrate scalability on over 4,000 nodes using a combination of CUDA and MPI. Finally, we show how mini-applications can be employed to improve the performance of production applications on existing parallel architectures by selecting the optimal application configuration. Using CleverLeaf, we identify the most appropriate configurations on three contemporary supercomputer architectures. Selecting the best parameters for our application can reduce run-time by up to 82% and reduce memory usage by up to 32%

    A Scalable Approach to Modeling on Accelerated Neuromorphic Hardware.

    Get PDF
    Neuromorphic systems open up opportunities to enlarge the explorative space for computational research. However, it is often challenging to unite efficiency and usability. This work presents the software aspects of this endeavor for the BrainScaleS-2 system, a hybrid accelerated neuromorphic hardware architecture based on physical modeling. We introduce key aspects of the BrainScaleS-2 Operating System: experiment workflow, API layering, software design, and platform operation. We present use cases to discuss and derive requirements for the software and showcase the implementation. The focus lies on novel system and software features such as multi-compartmental neurons, fast re-configuration for hardware-in-the-loop training, applications for the embedded processors, the non-spiking operation mode, interactive platform access, and sustainable hardware/software co-development. Finally, we discuss further developments in terms of hardware scale-up, system usability, and efficiency

    Parallel optimization algorithms for high performance computing : application to thermal systems

    Get PDF
    The need of optimization is present in every field of engineering. Moreover, applications requiring a multidisciplinary approach in order to make a step forward are increasing. This leads to the need of solving complex optimization problems that exceed the capacity of human brain or intuition. A standard way of proceeding is to use evolutionary algorithms, among which genetic algorithms hold a prominent place. These are characterized by their robustness and versatility, as well as their high computational cost and low convergence speed. Many optimization packages are available under free software licenses and are representative of the current state of the art in optimization technology. However, the ability of optimization algorithms to adapt to massively parallel computers reaching satisfactory efficiency levels is still an open issue. Even packages suited for multilevel parallelism encounter difficulties when dealing with objective functions involving long and variable simulation times. This variability is common in Computational Fluid Dynamics and Heat Transfer (CFD & HT), nonlinear mechanics, etc. and is nowadays a dominant concern for large scale applications. Current research in improving the performance of evolutionary algorithms is mainly focused on developing new search algorithms. Nevertheless, there is a vast knowledge of sequential well-performing algorithmic suitable for being implemented in parallel computers. The gap to be covered is efficient parallelization. Moreover, advances in the research of both new search algorithms and efficient parallelization are additive, so that the enhancement of current state of the art optimization software can be accelerated if both fronts are tackled simultaneously. The motivation of this Doctoral Thesis is to make a step forward towards the successful integration of Optimization and High Performance Computing capabilities, which has the potential to boost technological development by providing better designs, shortening product development times and minimizing the required resources. After conducting a thorough state of the art study of the mathematical optimization techniques available to date, a generic mathematical optimization tool has been developed putting a special focus on the application of the library to the field of Computational Fluid Dynamics and Heat Transfer (CFD & HT). Then the main shortcomings of the standard parallelization strategies available for genetic algorithms and similar population-based optimization methods have been analyzed. Computational load imbalance has been identified to be the key point causing the degradation of the optimization algorithm¿s scalability (i.e. parallel efficiency) in case the average makespan of the batch of individuals is greater than the average time required by the optimizer for performing inter-processor communications. It occurs because processors are often unable to finish the evaluation of their queue of individuals simultaneously and need to be synchronized before the next batch of individuals is created. Consequently, the computational load imbalance is translated into idle time in some processors. Several load balancing algorithms have been proposed and exhaustively tested, being extendable to any other population-based optimization method that needs to synchronize all processors after the evaluation of each batch of individuals. Finally, a real-world engineering application that consists on optimizing the refrigeration system of a power electronic device has been presented as an illustrative example in which the use of the proposed load balancing algorithms is able to reduce the simulation time required by the optimization tool.El aumento de las aplicaciones que requieren de una aproximación multidisciplinar para poder avanzar se constata en todos los campos de la ingeniería, lo cual conlleva la necesidad de resolver problemas de optimización complejos que exceden la capacidad del cerebro humano o de la intuición. En estos casos es habitual el uso de algoritmos evolutivos, principalmente de los algoritmos genéticos, caracterizados por su robustez y versatilidad, así como por su gran coste computacional y baja velocidad de convergencia. La multitud de paquetes de optimización disponibles con licencias de software libre representan el estado del arte actual en tecnología de optimización. Sin embargo, la capacidad de adaptación de los algoritmos de optimización a ordenadores masivamente paralelos alcanzando niveles de eficiencia satisfactorios es todavía una tarea pendiente. Incluso los paquetes adaptados al paralelismo multinivel tienen dificultades para gestionar funciones objetivo que requieren de tiempos de simulación largos y variables. Esta variabilidad es común en la Dinámica de Fluidos Computacional y la Transferencia de Calor (CFD & HT), mecánica no lineal, etc. y es una de las principales preocupaciones en aplicaciones a gran escala a día de hoy. La investigación actual que tiene por objetivo la mejora del rendimiento de los algoritmos evolutivos está enfocada principalmente al desarrollo de nuevos algoritmos de búsqueda. Sin embargo, ya se conoce una gran variedad de algoritmos secuenciales apropiados para su implementación en ordenadores paralelos. La tarea pendiente es conseguir una paralelización eficiente. Además, los avances en la investigación de nuevos algoritmos de búsqueda y la paralelización son aditivos, por lo que el proceso de mejora del software de optimización actual se verá incrementada si se atacan ambos frentes simultáneamente. La motivación de esta Tesis Doctoral es avanzar hacia una integración completa de las capacidades de Optimización y Computación de Alto Rendimiento para así impulsar el desarrollo tecnológico proporcionando mejores diseños, acortando los tiempos de desarrollo del producto y minimizando los recursos necesarios. Tras un exhaustivo estudio del estado del arte de las técnicas de optimización matemática disponibles a día de hoy, se ha diseñado una librería de optimización orientada al campo de la Dinámica de Fluidos Computacional y la Transferencia de Calor (CFD & HT). A continuación se han analizado las principales limitaciones de las estrategias de paralelización disponibles para algoritmos genéticos y otros métodos de optimización basados en poblaciones. En el caso en que el tiempo de evaluación medio de la tanda de individuos sea mayor que el tiempo medio que necesita el optimizador para llevar a cabo comunicaciones entre procesadores, se ha detectado que la causa principal de la degradación de la escalabilidad o eficiencia paralela del algoritmo de optimización es el desequilibrio de la carga computacional. El motivo es que a menudo los procesadores no terminan de evaluar su cola de individuos simultáneamente y deben sincronizarse antes de que se cree la siguiente tanda de individuos. Por consiguiente, el desequilibrio de la carga computacional se convierte en tiempo de inactividad en algunos procesadores. Se han propuesto y testado exhaustivamente varios algoritmos de equilibrado de carga aplicables a cualquier método de optimización basado en una población que necesite sincronizar los procesadores tras cada tanda de evaluaciones. Finalmente, se ha presentado como ejemplo ilustrativo un caso real de ingeniería que consiste en optimizar el sistema de refrigeración de un dispositivo de electrónica de potencia. En él queda demostrado que el uso de los algoritmos de equilibrado de carga computacional propuestos es capaz de reducir el tiempo de simulación que necesita la herramienta de optimización

    Tuning Parallel Applications in Parallel

    Get PDF
    Auto-tuning has recently received significant attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains such as dense linear algebra and stencil computations, or only at certain stages of program execution such as compile time and runtime. Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, we describe a unified end-to-end approach to auto-tuning scientific applications. Our system, Active Harmony, takes a search-based collaborative approach to auto-tuning. Application programmers, library writers and compilers collaborate to describe and export a set of performance related tunable parameters to the Active Harmony system. These parameters define a tuning search-space. The auto-tuner monitors the program performance and suggests adaptation decisions. The decisions are made by a central controller using a parallel search algorithm. The algorithm leverages parallel architectures to search across a set of optimization parameter values. Different nodes of a parallel system evaluate different configurations at each timestep. Active Harmony supports runtime adaptive code-generation and tuning for parameters that require new code (e.g. unroll factors). Effectively, we merge traditional feedback directed optimization and just-in-time compilation. This feature also enables application developers to write applications once and have the auto-tuner adjust the application behavior automatically when run on new systems. We evaluated our system on multiple large-scale parallel applications and showed that our system can improve the execution time by up to 46% compared to the original version of the program. Finally, we believe that the success of any auto-tuning research depends on how effectively application developers, domain-experts and auto-tuners communicate and work together. To that end, we have developed and released a simple and extensible language that standardizes the parameter space representation. Using this language, developers and researchers can collaborate to export tunable parameters to the tuning frameworks. Relationships (e.g. ordering, dependencies, constraints, ranking) between tunable parameters and search-hints can also be expressed

    Proceedings, MSVSCC 2015

    Get PDF
    The Virginia Modeling, Analysis and Simulation Center (VMASC) of Old Dominion University hosted the 2015 Modeling, Simulation, & Visualization Student capstone Conference on April 16th. The Capstone Conference features students in Modeling and Simulation, undergraduates and graduate degree programs, and fields from many colleges and/or universities. Students present their research to an audience of fellow students, faculty, judges, and other distinguished guests. For the students, these presentations afford them the opportunity to impart their innovative research to members of the M&S community from academic, industry, and government backgrounds. Also participating in the conference are faculty and judges who have volunteered their time to impart direct support to their students’ research, facilitate the various conference tracks, serve as judges for each of the tracks, and provide overall assistance to this conference. 2015 marks the ninth year of the VMASC Capstone Conference for Modeling, Simulation and Visualization. This year our conference attracted a number of fine student written papers and presentations, resulting in a total of 51 research works that were presented. This year’s conference had record attendance thanks to the support from the various different departments at Old Dominion University, other local Universities, and the United States Military Academy, at West Point. We greatly appreciated all of the work and energy that has gone into this year’s conference, it truly was a highly collaborative effort that has resulted in a very successful symposium for the M&S community and all of those involved. Below you will find a brief summary of the best papers and best presentations with some simple statistics of the overall conference contribution. Followed by that is a table of contents that breaks down by conference track category with a copy of each included body of work. Thank you again for your time and your contribution as this conference is designed to continuously evolve and adapt to better suit the authors and M&S supporters. Dr.Yuzhong Shen Graduate Program Director, MSVE Capstone Conference Chair John ShullGraduate Student, MSVE Capstone Conference Student Chai

    Simulation Intelligence: Towards a New Generation of Scientific Methods

    Full text link
    The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science

    Emergence of Intelligent Navigation Behavior in Embodied Agents from Massive-Scale Simulation

    Get PDF
    The goal of Artificial Intelligence is to build ‘thinking machines’ that ‘use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.’ In this dissertation, we will argue that the intelligence required for this goal emerges from massive-scale simulation. We will show a specific case: that intel- ligent navigation behavior emerges from massive-scale simulation and deep reinforcement learning. Towards this end, we introduce Decentralized Distributed PPO (DD-PPO), a method that scales reinforcement learning to multiple GPUs and machines. We use DD-PPO to train agents for PointGoal navigation (e.g. ‘Go 5 meters north and 10 meters east relative to start’) for the equivalent of 80 years of human experience. This massive-scale training results in near-perfect autonomous navigation in an unseen environment without access to a map. We then examine the inner workings of special case of PointGoalNav agents. We find that (1) their memory enables shortcuts, i.e. efficiently travel through previously unexplored parts of the environment; (2) there is emergence of maps in their memory, i.e. a detailed occupancy grid of the environment can be decoded from it. We then introduce Variable Experience Rollout (VER), a method that efficiently scales reinforcement learning on a single GPU or machine. We use VER to train chained skills for mobile manipulation. We find a surprising emergence of navigation in skills that do not ostensibly require any navigation. Specifically, the pick skill involves a robot picking an object from a table. During training, the robot was always spawned close to the table and never needs to navigate. However, we find that if navigation actions are part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.Ph.D