155 research outputs found
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
Parallel prefix operations on heterogeneous platforms
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento
computacional e na eficiencia enerxética, sendo un piar clave para a computación
de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa
de programar, e ten certos problemas asociados á portabilidade entre as diferentes
tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de
algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa
eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden
acelerar a computación destes algoritmos, tamén poden ser unha limitación cando
non explotan axeitadamente o paralelismo da arquitectura CPU.
Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos
de prefixo paralelo para calquera paradigma de programación paralela. Pola outra
banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente
algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU
CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de
tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no
rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores
óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e
da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de
fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o
que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o
tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven
problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira
trata con tamaños extremad8.1nente grandes, usando varias GPUs.
As nosas propostas proporcionan uns resultados moi competitivos a nivel de
rendemento, mellorando as propostas existentes na bibliografía para as operacións
probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen]
Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento
computacional y en la eficiencia energética, siendo una tecnología clave para la
computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es
costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus
códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de
prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las
ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque
las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser
una limitación si no explotan correctamente el paralelismo de la arquitectura CPU.
Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos
de prefijo paralelo que pueden ser implementados en cualquier paradigma de
programación paralela. Por otra parte, se propone una metodología general que implementa
eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable,
sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o
en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que
influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los
valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además,
las funciones GPU proporcionadas están compuestas de bloques de código CUDA
reutilizable y modular, lo que permite la implementación de cualquier algoritmo de
prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen
tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes,
utilizando para ello una única GPU i mientras que la tercera aproximación trata
con tamaños extremadamente grandes, usando varias GPUs.
Nuestras propuestas proporcionan resultados muy competitivos, mejorando el
rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas:
la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract]
Craphics Processing Units (CPUs) have shown remarkable advantages in computing
performance and energy efficiency, representing oue of the most promising
trends fúr the near-fnture of high perfonnance computing. However, these devices
also bring sorne programming complexities, and many efforts are required tú provide
portability between different generations. Additionally, parallel prefix algorithms
are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial
in roany computer sCience applications. Although GPUs can accelerate the computation
of such algorithms, they can also be a limitation when they do not match
correctly to the CPU architecture or do not exploit the CPU parallelism properly.
This dissertation presents two different perspectives. Gn the Oile hand, new
parallel prefix algorithms have been algorithmicany designed for any paranel progrannning
paradigm. On the other hand, a general tuning CPU methodology is
proposed to provide an easy and portable mechanism tú efficiently implement paranel
prefix algorithms on any CUDA CPU architecture, rather than focusing on a
particular algorithm or a CPU mode!. To accomplish this goal, the methodology
identifies the GPU parameters which influence on the performance and, following a
set oí performance premises, obtains the cOllvillient values oí these parameters depending
on the algorithm, the problem size and the CPU architecture. Additionally,
the provided CPU functions are composed of modular and reusable CUDA blocks
of code, which allow the easy implementation of any paranel prefix algorithm. Depending
on the size of the dataset, three different approaches are proposed. The first
two approaches solve small and medium-large datasets on a single GPU; whereas the
third approach deals with extremely large datasets on a Multiple-CPU environment.
OUT proposals provide very competitive performance, outperforming the stateof-
the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems
High Performance Computing via High Level Synthesis
As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs).
These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance.
An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry.
FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers.
High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs.
Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain.
Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion.
Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices
Otimização em GPU de bounding volume hierarchies para ray tracing
Orientador: Hélio PedriniDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Métodos de Ray Tracing são conhecidos por produzir imagens extremamente realistas ao custo de um alto esforço computacional. Pouco após terem surgido, percebeu-se que a maior parte do custo associado a estes métodos está relacionada a encontrar a intersecção entre o grande número de raios que precisam ser traçados e a geometria da cena. Estruturas de dados especiais que indexam e organizam a geometria foram propostas para acelerar estes cálculos, de forma que apenas um subconjunto da geometria precise ser verificado para encontrar as intersecções. Dentre elas, podemos destacar as Bounding Volume Hierarchies (BVH), que são estruturas usadas para agrupar objetos 3D hierarquicamente. Recentemente, uma grande quantidade de esforços foi aplicada para acelerar a construção destas estruturas e aumentar sua qualidade. Este trabalho apresenta um novo método para a construção de BVHs de alta qualidade em sistemas manycore. O método em questão é uma extensão do atual estado da arte na construção de BVHs em GPU, Treelet Restructuring Bounding Volume Hierarchy (TRBVH), e consiste em otimizar uma árvore já existente reorganizando subconjuntos de seus nós através de uma abordagem de agrupamento aglomerativo. A implementação deste método foi feita para a arquitetura Kepler utilizando CUDA e foi testada em dezesseis cenas que são comumente usadas para avaliar o desempenho de estruturas aceleradoras. É demonstrado que esta implementação é capaz de produzir árvores com qualidade comparável às geradas utilizando TRBVH para aquelas cenas, além de ser 30% mais rápidaAbstract: Ray tracing methods are well known for producing very realistic images at the expense of a high computational effort. Most of the cost associated with those methods comes from finding the intersection between the massive number of rays that need to be traced and the scene geometry. Special data structures were proposed to speed up those calculations by indexing and organizing the geometry so that only a subset of it has to be effectively checked for intersections. One such construct is the Bounding Volume Hierarchy (BVH), which is a tree-like structure used to group 3D objects hierarchically. Recently, a significant amount of effort has been put into accelerating the construction of those structures and increasing their quality. We present a new method for building high-quality BVHs on manycore systems. Our method is an extension of the current state-of-the-art on GPU BVH construction, Treelet Restructuring Bounding Volume Hierarchy (TRBVH), and consists of optimizing an already existing tree by rearranging subsets of its nodes using an agglomerative clustering approach. We implemented our solution for the NVIDIA Kepler architecture using CUDA and tested it on sixteen distinct scenes that are commonly used to evaluate the performance of acceleration structures. We show that our implementation is capable of producing trees whose quality is equivalent to the ones generated by TRBVH for those scenes, while being about 30% faster to do soMestradoCiência da ComputaçãoMestre em Ciência da Computaçã
Grid and high performance computing applied to bioinformatics
Recent advances in genome sequencing technologies and modern biological data
analysis technologies used in bioinformatics have led to a fast and continuous increase
in biological data. The difficulty of managing the huge amounts of data currently
available to researchers and the need to have results within a reasonable time have
led to the use of distributed and parallel computing infrastructures for their analysis.
In this context Grid computing has been successfully used. Grid computing is based
on a distributed system which interconnects several computers and/or clusters to
access global-scale resources. This infrastructure is
exible, highly scalable and can
achieve high performances with data-compute-intensive algorithms.
Recently, bioinformatics is exploring new approaches based on the use of hardware
accelerators, such as the Graphics Processing Units (GPUs). Initially developed as
graphics cards, GPUs have been recently introduced for scientific purposes by rea-
son of their performance per watt and the better cost/performance ratio achieved in
terms of throughput and response time compared to other high-performance com-
puting solutions.
Although developers must have an in-depth knowledge of GPU programming and
hardware to be effective, GPU accelerators have produced a lot of impressive results.
The use of high-performance computing infrastructures raises the question of finding
a way to parallelize the algorithms while limiting data dependency issues in order
to accelerate computations on a massively parallel hardware.
In this context, the research activity in this dissertation focused on the assessment
and testing of the impact of these innovative high-performance computing technolo-
gies on computational biology. In order to achieve high levels of parallelism and, in
the final analysis, obtain high performances, some of the bioinformatic algorithms
applicable to genome data analysis were selected, analyzed and implemented. These
algorithms have been highly parallelized and optimized, thus maximizing the GPU
hardware resources. The overall results show that the proposed parallel algorithms
are highly performant, thus justifying the use of such technology.
However, a software infrastructure for work
ow management has been devised to
provide support in CPU and GPU computation on a distributed GPU-based in-
frastructure. Moreover, this software infrastructure allows a further coarse-grained
data-parallel parallelization on more GPUs. Results show that the proposed appli-
cation speed-up increases with the increase in the number of GPUs
- …