140 research outputs found
Investigation of Parallel Data Processing Using Hybrid High Performance CPU + GPU Systems and CUDA Streams
The paper investigates parallel data processing in a hybrid CPU+GPU(s) system using multiple CUDA streams for overlapping communication and computations. This is crucial for efficient processing of data, in particular incoming data stream processing that would naturally be forwarded using multiple CUDA streams to GPUs. Performance is evaluated for various compute time to host-device communication time ratios, numbers of CUDA streams, for various numbers of threads managing computations on GPUs. Tests also reveal benefits of using CUDA MPS for overlapping communication and computations when using multiple processes. Furthermore, using standard memory allocation on a GPU and Unified Memory versions are compared, the latter including programmer added prefetching. Performance of a hybrid CPU+GPU version as well as scaling across multiple GPUs are demonstrated showing good speed-ups of the approach. Finally, the performance per power consumption of selected configurations are presented for various numbers of streams and various relative performances of GPUs and CPUs
GPU devices for safety-critical systems: a survey
Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft
Energy models in data parallel CPU/GPU computations
La tesi affronta il problema della modellazione dei consumi energetici in computazioni data parallel (map) in architetture con GPU. I modelli sviluppati sono utilizzati per valutare il compromesso tra performance e consumi. La tesi include risultati sperimentali che validano sia i modelli sviluppati che la metodologia di ricerca di un compromesso tra prestazioni e consumi
Un framework pour l'exécution efficace d'applications sur GPU et CPU+GPU
Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt
GPU optimizations for a production molecular docking code
Thesis (M.Sc.Eng.) -- Boston UniversityScientists have always felt the desire to perform computationally intensive tasks that surpass the capabilities of conventional single core computers. As a result of this trend, Graphics Processing Units (GPUs) have come to be increasingly used for general computation in scientific research. This field of GPU acceleration is now a vast and mature discipline.
Molecular docking, the modeling of the interactions between two molecules, is a particularly computationally intensive task that has been the subject of research for many years. It is a critical simulation tool used for the screening of protein compounds for drug design and in research of the nature of life itself. The PIPER molecular docking program was previously accelerated using GPUs, achieving a notable speedup over conventional single core implementation. Since its original release the development of the CPU based PIPER has not ceased, and it is now a mature and fast parallel code. The GPU version, however, still contains many potential points for optimization. In the current work, we present a new version of GPU PIPER that attains a 3.3x speedup over a parallel MPI version of PIPER running on an 8 core machine and using the optimized Intel Math Kernel Library. We achieve this speedup by optimizing existing kernels for modern GPU architectures and migrating critical code segments to the GPU. In particular, we both improve the runtime of the filtering and scoring stages by more than an order of magnitude, and move all molecular data permanently to the GPU to improve data locality. This new speedup is obtained while retaining a computational accuracy virtually identical to the CPU based version. We also demonstrate that, due to the algorithmic dependencies of the PIPER algorithm on the 3D Fast Fourier Transform, our GPU PIPER will likely remain proportionally faster than equivalent CPU based implementations, and with little room for further optimizations.
This new GPU accelerated version of PIPER is integrated as part of the ClusPro molecular docking and analysis server at Boston University. ClusPro has over 4000 registered users and more than 50000 jobs run over the past 4 years
G-PUF : asoftware-only PUF for GPUs
Physical Unclonable Functions (PUFs) are security primitives which allow the generation of unique IDs and security keys. Their security stems from the inherent process variations of silicon chips manufacturing, and the minute random effects introduced in integrated circuits. PUFs usually are manufactured speciffically for this purpose, but in the last few years several proposals have developed PUFs from off-the-shelf components. These Intrinsic PUFs avoid modifications in the hardware and explore the low cost of adapting existing technologies. Graphical Processing Units (GPUs) present themselves as promising candidates for an Intrinsic PUF. GPUs are massively multi-processed systems originally built for graphical computing and more recently re-designed for general computing. These devices are distributed across a variety of systems and application environments, from computer vision platforms, to server clusters and home computers. Building PUFs with software-only strategies is a challenging problem, since a PUF must evaluate process variations without rendering system performance, characteristics which are easily done in hardware. In this work we present G-PUF, an intrinsic PUF technology running entirely on CUDA. The proposed solution maps the distribution of soft-errors in matrix multiplications when the GPU is running on adversarial conditions of overclock and undervoltage. The resulting error map will be unique to each GPU, and using a novel Challenge-Response Pair extraction algorithm, G-PUF is able to retrieve secure-keys or an device ID without disclosing information about the PUF randomness. The system was tested in real setups and requires no modifications whatsoever to an already operational GPU. G-PUF was capable of achieving upwards of 94.73% of reliability without any error correction code and can provide up to 253 unique Challenge-Response Pairs.Physically Unclonable Functions (PUFs) sĂŁo primitivas de segurança que permitem a criação de identidades Ăşnicas e de chaves seguras. Sua segurança deriva das variações de processo intrĂnsecas Ă fabricação de chips de silĂcio, e os diminutos efeitos aleatĂłrios introduzidos em circuitos integrados. PUFs normalmente sĂŁo fabricados especificamente para esse propĂłsito, mas nos Ăşltimos anos várias propostas desenvolveram PUFs com componentes comuns. Esses PUFs IntrĂnsecos evitam modificações de hardware e exploram o baixo custo de adaptar tecnologias já existentes. Unidades de Processamento Gráfico (GPUs) se apresentam como candidatos promissores para um PUF IntrĂnseco. GPUs sĂŁo sistemas massivamente multi-processados, desenvolvidos originalmente para computação gráfica e mais recentemente reprojetadas para computação genĂ©rica. Esses dispositivos estĂŁo distribuidos atravĂ©s de uma variedade de sistemas e aplicações, desde plataformas de visĂŁo computacional atĂ© clusters de servidores e computadores pessoais. Construir PUFs com estratĂ©gias puramente em software Ă© um processo desafiador, já que um PUF deve avaliar variações de processo sem afetar a performance do sistema, caracterĂsticas que sĂŁo mais facilmente alcançáceis em hardware. Nesse trabalho, apresentamos o G-PUF, uma tecnologia de PUF IntrĂnseco rodando puramente em CUDA. A solução proposta mapeia a distribuição de soft-errors em multiplicações de matrizes, enquanto a GPU opera em condições adversas como overclock e subalimentação. O mapa de erros resultante será Ăşnico para cada GPU, e utilizando um novo algorĂtmo para a extração de pares de desafio-resposta, o G-PUF consegue extrair chaves seguras e a identidade do dispositivo sem revelar informações sobre a sua aleatoriedade. O sistema foi testado em condições reais e nĂŁo requer nenhuma modificação para um sistema de GPU já em operação. G-PUF foi capaz de alcançar uma reliability de atĂ© 94.73% sem utilizar nenhum cĂłdigo de correção de erros e pode prover atĂ© 253 pares de desafio-resposta Ăşnicos
Prediction Models for Estimating the Efficiency of Distributed Multi-Core Systems
The efficiency of a multi-core architecture is directly related to the mechanisms that
map the threads (processes in execution) to the cores. Determining the resource
availability (CPU and main memory) of the multi-core architecture based on the
characteristics of the threads that are in execution is the art of system performance
prediction. In this dissertation we develop several prediction models for multi-core
architectures and perform empirical evaluations to demonstrate the accuracy of these
models.
Prediction of resource availability is important in the context of making process
assignment, load balancing, and scheduling decisions. In distributed infrastructure,
resources are allocated on demand on a chosen set of compute nodes. The nodes
chosen to perform the computations dictate the efficiency by which the jobs assigned
to them will be executed. The prediction models allows us to estimate the resource
availability without explicitly querying the individual nodes. With the model in hand
and knowledge of the jobs (such as peak memory requirement and CPU execution
profile), we can determine the appropriate compute nodes for each of the jobs in
such a way that it will improve resource utilization and speed job execution.
More specially, we have accomplished the following as part of this dissertation:
(a) Develop mathematical models to estimate the upper- and lower-limits of CPU
and memory availability for single- and multi-core architectures.
(b) Perform empirical evaluation in a heterogeneous environment to validate the
accuracy of the models.
(c) Introduce two task assignment policies that are capable of dispatching tasks to
distributed compute nodes intelligently by utilizing composite prediction and
CPU usage models.
(d) Propose a technique and introduce models to identify combinations of parameters
for efficiency usage of GPU devices to obtain optimal performance
Recommended from our members
AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs
Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs.
This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor.
The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.
To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled
A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures. Keywords: Method of Characteristics; Neutron transport; Reactor simulation; High performance computingUnited States. Department of Energy (Contract DE-AC02-06CH11357
Using precision reduction to efficiently improve mixed-precision GPUs reliability
Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work
- …