12 research outputs found

    Integrating an N -Body Problem with SDC and PFASST

    Get PDF
    Vortex methods for the Navier–Stokes equations are based on a Lagrangian particle discretization, which reduces the governing equations to a first-order initial value system of ordinary differential equations for the position and vorticity of N particles. In this paper, the accuracy of solving this system by time-serial spectral deferred corrections (SDC) as well as by the time-parallel Parallel Full Approximation Scheme in Space and Time (PFASST) is investigated. PFASST is based on intertwining SDC iterations with differing resolution in a manner similar to the Parareal algorithm and uses a Full Approximation Scheme (FAS) correction to improve the accuracy of coarser SDC iterations. It is demonstrated that SDC and PFASST can generate highly accurate solutions, and the performance in terms of function evaluations required for a certain accuracy is analyzed and compared to a standard Runge–Kutta method

    A Scalable Parallel Algorithm for the Simulation of Structural Plasticity in the Brain

    Get PDF
    The neural network in the brain is not hard-wired. Even in the mature brain, new connections between neurons are formed and existing ones are deleted, which is called structural plasticity. The dynamics of the connectome is key to understanding how learning, memory, and healing after lesions such as stroke work. However, with current experimental techniques even the creation of an exact static connectivity map, which is required for various brain simulations, is very difficult. One alternative is to use simulation based on network models to predict the evolution of synapses between neurons based on their specified activity targets. This is particularly useful as experimental measurements of the spiking frequency of neurons are more easily accessible and reliable than biological connectivity data. The Model of Structural Plasticity (MSP) by Butz and van Ooyen is an example of this approach. In traditional models, connectivity between neurons is fixed while plasticity merely arises from changes in the strength of existing synapses, typically modeled as weight factors. MSP, in contrast, models a synapse as a connection between an "axonal" plug and a "dendritic" socket. These synaptic elements grow and shrink independently on each neuron. When an axonal element of one neuron connects to the dendritic element of another neuron, a new synapse is formed. Conversely, when a synaptic element bound in a synapse retracts, the corresponding synapse is removed. The governing idea of the model is that plasticity in cortical networks is driven by the need of individual neurons to homeostatically maintain their average electrical activity. However, to predict which neurons connect to each other, the current MSP model computes probabilities for all pairs of neurons, resulting in a complexity O(n^2). To enable large-scale simulations with millions of neurons and beyond, this quadratic term is prohibitive. Inspired by hierarchical methods for solving n-body problems in particle physics, this dissertation presents a scalable approximation algorithm for simulating structural plasticity based on MSP. To scale MSP to millions of neurons, we adapt the Barnes-Hut algorithm as used in gravitational particle simulations to a scalable solution for the simulation of structural plasticity in the brain with a time complexity of O(n log^2 n) instead of O(n^2). Then, we show through experimental validation that the approximation underlying the algorithm does not adversely affect the quality of the results. For this purpose, we compare neural networks created by the original MSP with those created by our approximation of it using graph metrics. Finally, we prove that our scalable approximation algorithm can simulate the dynamics of the connectome with 10^9 neurons - four orders of magnitude more than the naive O(n^2) version, and two orders less than the human brain. We present an MPI-based scalable implementation of the scalable algorithm and our performance extrapolations predict that, given sufficient compute resources, even with today's technology a full-scale simulation of the human brain with 10^11 neurons is possible in principle. Until now, the scale of the largest structural plasticity simulations of MSP in terms of the number of neurons corresponded to that of a fruit fly. Our approximation algorithm goes a significant step further, reaching a scale similar to that of a galago primate. Additionally, large-scale brain connectivity maps can now be grown from scratch and their evolution after destructive events such as stroke can be simulated

    Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

    Get PDF
    The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance

    Afterlive: A performant code for Vlasov-Hybrid simulations

    Full text link
    A parallelized implementation of the Vlasov-Hybrid method [Nunn, 1993] is presented. This method is a hybrid between a gridded Eulerian description and Lagrangian meta-particles. Unlike the Particle-in-Cell method [Dawson, 1983] which simply adds up the contribution of meta-particles, this method does a reconstruction of the distribution function ff in every time step for each species. This interpolation method combines meta-particles with different weights in such a way that particles with large weight do not drown out particles that represent small contributions to the phase space density. These core properties allow the use of a much larger range of macro factors and can thus represent a much larger dynamic range in phase space density. The reconstructed phase space density ff is used to calculate momenta of the distribution function such as the charge density ρ\rho. The charge density ρ\rho is also used as input into a spectral solver that calculates the self-consistent electrostatic field which is used to update the particles for the next time-step. Afterlive (A Fourier-based Tool in the Electrostatic limit for the Rapid Low-noise Integration of the Vlasov Equation) is fully parallelized using MPI and writes output using parallel HDF5. The input to the simulation is read from a JSON description that sets the initial particle distributions as well as domain size and discretization constraints. The implementation presented here is intentionally limited to one spatial dimension and resolves one or three dimensions in velocity space. Additional spatial dimensions can be added in a straight forward way, but make runs computationally even more costly.Comment: Accepted for publication in Computer Physics Communication

    Enabling the use of embedded and mobile technologies for high-performance computing

    Get PDF
    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

    Towards instantaneous performance analysis using coarse-grain sampled and instrumented data

    Get PDF
    Nowadays, supercomputers deliver an enormous amount of computation power; however, it is well-known that applications only reach a fraction of it. One limiting factor is the single processor performance because it ultimately dictates the overall achieved performance. Performance analysis tools help locating performance inefficiencies and their nature to ultimately improve the application performance. Performance tools rely on two collection techniques to invoke their performance monitors: instrumentation and sampling. Instrumentation refers to inject performance monitors into concrete application locations whereas sampling invokes the installed monitors to external events. Each technique has its advantages. The measurements obtained through instrumentation are directly associated to the application structure while sampling allows a simple way to determine the volume of measurements captured. However, the granularity of the measurements that provides valuable insight cannot be determined a priori. Should analysts study the performance of an application for the first time, they may consider using a performance tool and instrument every routine or use high-frequency sampling rates to provide the most detailed results. These approaches frequently lead to large overheads that impact the application performance and thus alter the measurements gathered and, therefore, mislead the analyst. This thesis introduces the folding mechanism that takes advantage of the repetitiveness found in many applications. The mechanism smartly combines metrics captured through coarse-grain sampling and instrumentation mechanisms to provide instantaneous metric reports within instrumented regions and without perturbing the application execution. To produce these reports, the folding processes metrics from different type of sources: performance and energy counters, source code and memory references. The process depends on their nature. While performance and energy counters represent continuous metrics, the source code and memory references refer to discrete values that point out locations within the application code or address space. This thesis evaluates and validates two fitting algorithms used in different areas to report continuous metrics: a Gaussian interpolation process known as Kriging and piece-wise linear regressions. The folding also takes benefit of analytical performance models to focus on a small set of performance metrics instead of exploring a myriad of performance counters. The folding also correlates the metrics with the source-code using two alternatives: using the outcome of the piece-wise linear regressions and a mechanism inspired by Multi-Sequence Alignment techniques. Finally, this thesis explores the applicability of the folding mechanism to captured memory references to detail which and how data objects are accessed. This thesis proposes an analysis methodology for parallel applications that focus on describing the most time-consuming computing regions. It is implemented on top of a framework that relies on a previously existing clustering tool and the folding mechanism. To show the usefulness of the methodology and the framework, this thesis includes the discussion of multiple first-time seen in-production applications. The discussions include high level of detail regarding the application performance bottlenecks and their responsible code. Despite many analyzed applications have been compiled using aggressive compiler optimization flags, the insight obtained from the folding mechanism has turned into small code transformations based on widely-known optimization techniques that have improved the performance in some cases. Additionally, this work also depicts power monitoring capabilities of recent processors and discusses the simultaneous performance and energy behavior on a selection of benchmarks and in-production applications.Actualment, els supercomputadors ofereixen una àmplia potència de càlcul però les aplicacions només en fan servir una petita fracció. Un dels factors limitants és el rendiment d'un processador, el qual dicta el rendiment en general. Les eines d'anàlisi de rendiment ajuden a localitzar els colls d'ampolla i la seva natura per a, eventualment, millorar el rendiment de l'aplicació. Les eines d'anàlisi de rendiment empren dues tècniques de recol·lecció de dades: instrumentació i mostreig. La instrumentació es refereix a la capacitat d'injectar monitors en llocs específics del codi mentre que el mostreig invoca els monitors quan ocórren esdeveniments externs. Cadascuna d'aquestes tècniques té les seves avantatges. Les mesures obtingudes per instrumentació s'associen directament a l'estructura de l'aplicació mentre que les obtingudes per mostreig permeten una forma senzilla de determinar-ne el volum capturat. Sigui com sigui, la granularitat de les mesures no es pot determinar a priori. Conseqüentment, si un analista vol estudiar el rendiment d'una aplicació sense saber-ne res, hauria de considerar emprar una eina d'anàlisi i instrumentar cadascuna de les rutines o bé emprar freqüències de mostreig altes per a proveir resultats detallats. En qualsevol cas, aquestes alternatives impacten en el rendiment de l'aplicació i per tant alterar les mètriques capturades, i conseqüentment, confondre a l'analista. Aquesta tesi introdueix el mecanisme anomenat folding, el qual aprofita la repetitibilitat existent en moltes aplicacions. El mecanisme combina intel·ligentment mètriques obtingudes mitjançant mostreig de gra gruixut i instrumentació per a proveir informes de mètriques instantànies dins de regions instrumentades sense pertorbar-ne l'execució. Per a produir aquests informes, el mecanisme processa les mètriques de diferents fonts: comptadors de rendiment i energia, codi font i referències de memoria. El procés depen de la natura de les dades. Mentre que les mètriques de rendiment i energia són valors continus, el codi font i les referències de memòria representen valors discrets que apunten ubicacions dins el codi font o l'espai d'adreces. Aquesta tesi evalua i valida dos algorismes d'ajust: un procés d'interpolació anomenat Kriging i una interpolació basada en regressions lineals segmentades. El mecanisme de folding també s'aprofita de models analítics de rendiment basats en comptadors hardware per a proveir un conjunt reduït de mètriques enlloc d'haver d'explorar una multitud de comptadors. El mecanisme també correlaciona les mètriques amb el codi font emprant dues alternatives: per un costat s'aprofita dels resultats obtinguts per les regressions lineals segmentades i per l'altre defineix un mecanisme basat en tècniques d'alineament de multiples seqüències. Aquesta tesi també explora l'aplicabilitat del mecanisme per a referències de memoria per a informar quines i com s'accessedeixen les dades de l'aplicació. Aquesta tesi proposa una metodología d'anàlisi per a aplicacions paral·leles centrant-se en descriure les regions de càlcul que consumeixen més temps. La metodología s'implementa en un entorn de treball que usa un mecanisme de clustering preexistent i el mecanisme de folding. Per a demostrar-ne la seva utilitat, aquesta tesi inclou la discussió de múltiples aplicacions analitzades per primera vegada. Les discussions inclouen un alt nivel de detall en referencia als colls d'ampolla de les aplicacions i de la seva natura. Tot i que moltes d'aquestes aplicacions s'han compilat amb opcions d'optimització agressives, la informació obtinguda per l'entorn de treball es tradueix en petites modificacions basades en tècniques d'optimització que permeten millorar-ne el rendiment en alguns casos. Addicionalment, aquesta tesi també reporta informació sobre el consum energètic reportat per processadors recents i discuteix el comportament simultani d'energia i rendiment en una selecció d'aplicacions sintètiques i aplicacions en producció
    corecore