Search CORE

118 research outputs found

Performance and portability of accelerated lattice Boltzmann applications with OpenACC

Author: Calore Enrico
Gabbana Alessandro
Kraus Jiri
Schifano Sebastiano Fabio
Tripiccione Raffaele
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Ferrara

STAPL-RTS: A Runtime System for Massive Parallelism

Author: Papadopoulos Ioannis
Publication venue
Publication date: 08/07/2016
Field of study

Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm

Texas A&M Repository

Group-Based Parallel Multi-scheduling Methods for Grid Computing

Author: Abraham Goodhead Tomvie
Publication venue
Publication date: 01/01/2016
Field of study

Coventry University Pure Portal

On the Porting and Optimisation of Physics Simulations for Heterogeneous Parallel Processors

Author: Martineau Matt J
Publication venue
Publication date: 25/06/2019
Field of study

Explore Bristol Research

Efficient Multiprogramming for Multicores with SCAF

Author: Creech Timothy Mattausch
Publication venue
Publication date: 01/01/2015
Field of study

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning --- the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine

Digital Repository at the University of Maryland

Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

Author: Nozal Raúl
Publication venue
Publication date: 21/01/2022
Field of study

RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

UCrea

A Roadmap for HEP Software and Computing R&D for the 2020s

Author: Albrecht Johannes
Alves Antonio Augusto, Jr
Amadio Guilherme
Andronico Giuseppe
Anh-Ky Nguyen
Aphecetche Laurent
Apostolakis John
Asai Makoto
Atzori Luca
Babik Marian
Bagliesi Giuseppe
Bandieramonte Marilena
Banerjee Sunanda
Barisits Martin
Bauerdick Lothar A. T.
Belforte Stefano
Benjamin Douglas
Bernius Catrin
Bhimji Wahid
Bianchi Riccardo Maria
Bird Ian
Biscarat Catherine
Blomer Jakob
Bloom Kenneth
Boccali Tommaso
Bockelman Brian
Bold Tomasz
Bonacorsi Daniele
Boveia Antonio
Bozzi Concezio
Bracko Marko
Britton David
Buckley Andy
Buncic Predrag
Calafiura Paolo
Campana Simone
Canal Philippe
Canali Luca
Carlino Gianpaolo
Castro Nuno
Cattaneo Marco
Cerminara Gianluca
Cervantes Villanueva Javier
Chang Philip
Chapman John
Chen Gang
Childers Taylor
Clarke Peter
Clemencic Marco
Cogneras Eric
Coles Jeremy
Collier Ian
Colling David
Corti Gloria
Cosmo Gabriele
Costanzo Davide
Couturier Ben
Cranmer Kyle
Cranshaw Jack
Cristella Leonardo
Crooks David
Crépé-Renaudin Sabine
Currie Robert
Dallmeier-Tiessen Sünje
De Cian Michel
De Roeck Albert
De Kaushik
Delgado Peris Antonio
Derue Frédéric
Di Girolamo Alessandro
Di Guida Salvatore
Dimitrov Gancho
Doglioni Caterina
Dotti Andrea
Duellmann Dirk
Duflot Laurent
Dykstra Dave
Dziedziniewicz-Wojcik Katarzyna
Dziurda Agnieszka
Egede Ulrik
Elmer Peter
Elmsheuser Johannes
Elvira V. Daniel
Eulisse Giulio
Farrell Steven
Ferber Torben
Filipcic Andrej
Fisk Ian
Fitzpatrick Conor
Flix José
Formica Andrea
Forti Alessandra
Foundation HEP Software
Franzoni Giovanni
Frost James
Fuess Stu
Gaede Frank
Ganis Gerardo
Gardner Robert
Garonne Vincent
Gellrich Andreas
Genser Krzysztof
George Simon
Geurts Frank
Gheata Andrei
Gheata Mihaela
Giacomini Francesco
Giagu Stefano
Giffels Manuel
Gingrich Douglas
Girone Maria
Gligorov Vladimir V.
Glushkov Ivan
Gohn Wesley
Gonzalez Lopez Jose Benito
González Caballero Isidro
González Fernández Juan R.
Govi Giacomo
Grandi Claudio
Grasland Hadrien
Gray Heather
Grillo Lucia
Guan Wen
Gutsche Oliver
Gyurjyan Vardan
Hanushevsky Andrew
Hariri Farah
Hartmann Thomas
Harvey John
Hauth Thomas
Hegner Benedikt
Heinemann Beate
Heinrich Lukas
Heiss Andreas
Hernández José M.
Hildreth Michael
Hodgkinson Mark
Hoeche Stefan
Holzman Burt
Hristov Peter
Huang Xingtao
Ivanchenko Vladimir N.
Ivanov Todor
Iven Jan
Jashal Brij
Jayatilaka Bodhitha
Jones Roger
Jouvin Michel
Jun Soon Yung
Kagan Michael
Kalderon Charles William
Kane Meghan
Karavakis Edward
Katz Daniel S.
Kcira Dorian
Keeble Oliver
Kersevan Borut Paul
Kirby Michael
Klimentov Alexei
Klute Markus
Komarov Ilya
Konstantinov Dmitri
Koppenburg Patrick
Kowalkowski Jim
Kreczko Luke
Kuhr Thomas
Kutschke Robert
Kuznetsov Valentin
Lampl Walter
Lancon Eric
Lange David
Lassnig Mario
Laycock Paul
Leggett Charles
Letts James
Lewendel Birgit
Li Teng
Lima Guilherme
Linacre Jacob
Linden Tomas
Livny Miron
Lo Presti Giuseppe
Lopienski Sebastian
Love Peter
Lyon Adam
Magini Nicolò
Marshall Zachary L
Martelli Edoardo
Martin-Haugh Stewart
Mato Pere
Mazumdar Kajari
McCauley Thomas
McFayden Josh
McKee Shawn
McNab Andrew
Mehdiyev Rashid
Meinhard Helge
Menasce Dario
Mendez Lorenzo Patricia
Mete Alaettin Serhan
Michelotto Michele
Mitrevski Jovan
Moneta Lorenzo
Morgan Ben
Mount Richard
Moyse Edward
Murray Sean
Nairz Armin
Neubauer Mark S
Norman Andrew
Novaes Sérgio
Novak Mihaly
Oyanguren Arantza
Ozturk Nurcan
Pacheco Pages Andres
Paganini Michela
Pansanel Jerome
Pascuzzi Vincent R.
Patrick Glenn
Pearce Alex
Pearson Ben
Pedro Kevin
Perdue Gabriel
Perez-Calero Yzquierdo Antonio
Perrozzi Luca
Petersen Troels
Petric Marko
Petzold Andreas
Piedra Jónatan
Piilonen Leo
Piparo Danilo
Pivarski Jim
Pokorski Witold
Polci Francesco
Potamianos Karolos
Psihas Fernanda
Puig Navarro Albert
Quast Günter
Raven Gerhard
Reuter Jürgen
Ribon Alberto
Rinaldi Lorenzo
Ritter Martin
Robinson James
Rodrigues Eduardo
Roiser Stefan
Rousseau David
Roy Gareth
Rybkine Grigori
Sailer Andre
Sakuma Tai
Santana Renato
Sartirana Andrea
Schellman Heidi
Schovancová Jaroslava
Schramm Steven
Schulz Markus
Sciabà Andrea
Seidel Sally
Sekmen Sezen
Serfon Cedric
Severini Horst
Sexton-Kennedy Elizabeth
Seymour Michael
Sgalaberna Davide
Shapoval Illya
Shiers Jamie
Shiu Jing-Ge
Short Hannah
Siroli Gian Piero
Skipsey Sam
Smith Tim
Snyder Scott
Sokoloff Michael D
Spentzouris Panagiotis
Stadie Hartmut
Stark Giordon
Stewart Gordon
Stewart Graeme
Sánchez Arturo
Sánchez-Hernández Alberto
Taffard Anyes
Tamponi Umberto
Templon Jeff
Tenaglia Giacomo
Tsulaia Vakhtang
Tunnell Christopher
Vaandering Eric
Valassi Andrea
Vallecorsa Sofia
Valsan Liviu
Van Gemmeren Peter
Vernet Renaud
Viren Brett
Vlimant Jean-Roch
Voss Christian
Votava Margaret
Vuosalo Carl
Vázquez Sierra Carlos
Wartel Romain
Watts Gordon T.
Wenaus Torre
Wenzel Sandro
Williams Mike
Winklmeier Frank
Wissing Christoph
Wuerthwein Frank
Wynne Benjamin
Xiaomei Zhang
Yang Wei
Yazgan Efe
Publication venue
Publication date: 18/12/2017
Field of study

Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.Peer reviewe

Universidade do Minho: RepositoriUM

Hal - Université Grenoble Alpes

HAL Clermont Université

Helsingin yliopiston digitaalinen arkisto

HAL-CEA

Hal-Diderot

arXiv.org e-Print Archive

HAL-IN2P3

DSpace@MIT

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio della ricerca- Università di Roma La Sapienza

CERN Document Server

HAL-Polytechnique

Explore Bristol Research

Techniques of design optimisation for algorithms implemented in software

Author: Hopson Benjamin Thomas Ken
Publication venue: The University of Edinburgh
Publication date: 27/06/2016
Field of study

The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis

Edinburgh Research Archive