531 research outputs found
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Hyperswitch communication network
The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed
The PARSE Programming Paradigm. Part I: Software Development Methodology. Part II: Software Development Support Tools
The programming methodology of PARSE (parallel software environment), a software environment being developed for reconfigurable non-shared memory parallel computers, is described. This environment will consist of an integrated collection of language interfaces, automatic and semi-automatic debugging and analysis tools, and operating system —all of which are made more flexible by the use of a knowledge-based implementation for the tools that make up PARSE. The programming paradigm supports the user freely choosing among three basic approaches /abstractions for programming a parallel machine: logic-based descriptive, sequential-control procedural, and parallel-control procedural programming. All of these result in efficient parallel execution. The current work discusses the methodology underlying PARSE, whereas the companion paper, “The PARSE Programming Paradigm — II: Software Development Support Tools,” details each of the component tools
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
Developing a support for FPGAs in the Controller parallel programming model
La computación heterogénea se presenta como la solución para conseguir supercomputadores cada vez
más rápidos capaces de resolver problemas más grandes y complejos en diferentes áreas de conocimiento.
Para ello, integra aceleradores con distintas arquitecturas capaces de explotar las características de los
problemas desde distintos enfoques obteniendo, de este modo, un mayor rendimiento.
Las FPGAs son hardware reconfigurable, i.e., es posible modificarlas después de su fabricación. Esto
permite una gran flexibilidad y una máxima adaptación al problema en cuestión. Además, tienen un
consumo energético muy bajo. Todas estas ventajas tienen el gran inconveniente de una más difícil programaci
ón mediante los propensos a errores HDLs (Hardware Description Language), tales como Verilog o
VHDL, y requisitos de conocimientos avanzados de electrónica digital. En los últimos años los principales
fabricantes de FPGAs han enfocado sus esfuerzos en desarrollar herramientas HLS (High Level Synthesis)
que permiten programarlas a través de lenguajes de programación de alto nivel estilo C. Esto ha favorecido
su adopción por la comunidad HPC y su integración en los nuevos supercomputadores. Sin embargo, el
programador aún tiene que ocuparse de aspectos como la gestión de colas de comandos, parámetros de
lanzamiento o transferencias de datos.
El modelo Controller es una librería que facilita la gestión de la coordinación, comunicación y los
detalles de lanzamiento de los kernels en aceleradores hardware. Explota de forma transparente sus modelos
de programación nativos, en concreto OpenCL y CUDA, y, por tanto, consigue un alto rendimiento
independientemente del compilador. Permite al programador utilizar los distintos recursos hardware
disponibles de forma combinada en entornos heterogéneos.
Este trabajo extiende el modelo Controller mediante el desarrollo de un backend que permite la
integración de FPGAs, manteniendo los cambios sobre la interfaz de usuario al mínimo. A través de los
resultados experimentales se comprueba que se consigue una disminución del esfuerzo de programación
significativa en comparación con la implementación nativa en OpenCL. Del mismo modo, se consigue
un elevado solapamiento entre computación y comunicación y un sobrecoste por el uso de la librería
despreciable.Heterogeneous computing appears to be the solution to achieve ever faster computers capable of solving
bigger and more complex problems in difierent fields of knowledge. To that end, it integrates accelerators
with difierent architectures capable of exploiting the features of problems from difierent perspectives thus
achieving higher performance.
FPGAs are reconfigurable hardware, i.e., it is possible to modify them after manufacture. This allows
great flexibility and maximum adaptability to the given problem. In addition, they have low power
consumption. All these advantages have the great objection of more dificult programming with the errorprone
HDLs (Hardware Description Language), such as Verilog or VHDL, and the requirement of advanced
knowledge of digital electronics. The main FPGA vendors have concentrated on developing HLS (High
Level Synthesis) tools that allow to program them with C-like high level programming languages. This
favoured their adoption by the HPC community and their integration in new supercomputers. However,
the programmer still has to take care of aspects such as management of command queues, launching
parameters or data transfers.
The Controller model is a library to easily manage the coordination, communication and kernel launching
details on hardware accelerators. It transparently exploits their native or vendor specific programming
models, namely OpenCL and CUDA, thus enabling the potential performance obtained by using them in
a compiler agnostic way. It is intended to enable the programmer to make use of the diferent available
hardware resources in combination in heterogeneous environments.
This work extends the Controller model through the development of a backend that allows the integration
of FPGAs, keeping the changes over the user-facing interface to the minimum. The experimental
results validate that a significant decrease in programming effort compared to the native OpenCL implementation
is achieved. Similarly, high overlap of computation and communication and a negligible
overhead due to the use of the library are attained.Grado en Ingeniería Informátic
- …