6 research outputs found
Movement and placement of non-contiguous data in distributed GPU computing
A steady increase in accelerator performance has driven demand for faster interconnects to avert the memory bandwidth wall. This has resulted in the wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnects hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement.
This work finds that empirical communication measurements can be used to automatically schedule and execute intra- and inter-node communication in a modern heterogeneous system, providing ``hand-tuned'' performance without the need for complex or error-prone communication development at the application level.
Empirical measurements are provided by a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. These benchmarks are the first comprehensive evaluation of all GPU communication primitives. For communication-heavy applications, optimally using communication capabilities is challenging and essential for performance. Two different approaches are examined.
The first is a high-level 3D stencil communication library, which can automatically create a static communication plan based on the stencil and system parameters. This library is able to reduce the iteration time of a state-of-the-art stencil code by 1.45x at 3072 GPUs and 512 nodes.
The second is a more general MPI interposer library, with novel non-contiguous data handling and runtime implementation selection for MPI communication primitives. A portable pure-MPI halo exchange is brought to within half the speed of the stencil-specific library, supported by a five order-of-magnitude improvement in MPI communication latency for non-contiguous data
Parallel prefix operations on heterogeneous platforms
Programa Oficial de Doutoramento en Investigaci贸n en Tecnolox铆as da Informaci贸n. 524V01[Resumo]
As tarxetas gr谩ficas, co帽ecidas como GPUs, aportan grandes vantaxes no rendemento
computacional e na eficiencia enerx茅tica, sendo un piar clave para a computaci贸n
de altas prestaci贸ns (HPC). Sen embargo, esta tecnolox铆a tam茅n 茅 custosa
de programar, e ten certos problemas asociados 谩 portabilidade entre as diferentes
tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de
algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa
eficiencia 茅 esencial en moita."3 aplicaci贸ns. Neste eiclo, a铆nda que as GPUs poden
acelerar a computaci贸n destes algoritmos, tam茅n poden ser unha limitaci贸n cando
non explotan axeitadamente o paralelismo da arquitectura CPU.
Esta Tese presenta d煤as perspectivas. Dunha parte, des茅帽anse novos algoritmos
de prefixo paralelo para calquera paradigma de programaci贸n paralela. Pola outra
banda, tam茅n se prop贸n unha metodolox脥a xeral que implementa eficientemente
algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU
CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de
tarxeta. Para isto, a metodolox铆a identifica os param茅tros da GPU que infl煤en no
rendemento e, despois, seguindo unha serie de premisas te贸ricas, obt茅帽ense os valores
贸ptimos destes par谩metros dependendo do algoritmo, do tama帽o do problema e
da arquitectura GPU empregada. Ademais, esta Tese tam茅n prov茅 unha serie de
fUllci贸lls GPU compostas de bloques de c贸digo CUDA modulares e reutilizables, o
que permite a implementaci贸n de calquera algoritmo de xeito sinxelo. Segundo o
tama帽o do problema, prop贸帽ense tres aproximaci贸ns. As d煤as primeiras resolven
problemas pequenos, medios e grandes nunha 煤nica GPU) mentras que a terceira
trata con tama帽os extremad8.1nente grandes, usando varias GPUs.
As nosas propostas proporcionan uns resultados moi competitivos a nivel de
rendemento, mellorando as propostas existentes na bibliograf铆a para as operaci贸ns
probadas: a primitiva sean, ordenaci贸n e a resoluci贸n de sistemas tridiagonais.[Resumen]
Las tarjetas gr谩ficas (GPUs) han demostrado gmndes ventajas en el rendimiento
computacional y en la eficiencia energ茅tica, siendo una tecnolog铆a clave para la
computaci贸n de altas prestaciones (HPC). Sin embargo, esta tecnolog铆a tambi茅n es
costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus
c贸digos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de
prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las
ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque
las GPUs puedan acelerar la computaci贸n de estos algoritmos, tambi茅n pueden ser
una limitaci贸n si no explotan correctamente el paralelismo de la arquitectura CPU.
Esta Tesis presenta dos perspectivas. De un lado, se han dise帽ado nuevos algoritmos
de prefijo paralelo que pueden ser implementados en cualquier paradigma de
programaci贸n paralela. Por otra parte, se propone una metodolog铆a general que implementa
eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable,
sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o
en un modelo de tarjeta. Para ello, la metodolog铆a identifica los par谩metros GPU que
influyen en el rendimiento y, siguiendo un conjunto de premisas te贸ricas, obtiene los
valores 贸ptimos para cada algoritmo, tama帽o de problema y arquitectura. Adem谩s,
las funciones GPU proporcionadas est谩n compuestas de bloques de c贸digo CUDA
reutilizable y modular, lo que permite la implementaci贸n de cualquier algoritmo de
prefijo paralelo sencillamente. Dependiendo del tama帽o del problema, se proponen
tres aproximaciones. Las dos primeras resuelven tama帽os peque帽os, medios y grandes,
utilizando para ello una 煤nica GPU i mientras que la tercera aproximaci贸n trata
con tama帽os extremadamente grandes, usando varias GPUs.
Nuestras propuestas proporcionan resultados muy competitivos, mejorando el
rendimiento de las propuestas existentes en la bibliograf铆a para las operaciones probadas:
la primitiva sean, ordenaci贸n y la resoluci贸n de sistemas tridiagonales.[Abstract]
Craphics Processing Units (CPUs) have shown remarkable advantages in computing
performance and energy efficiency, representing oue of the most promising
trends f煤r the near-fnture of high perfonnance computing. However, these devices
also bring sorne programming complexities, and many efforts are required t煤 provide
portability between different generations. Additionally, parallel prefix algorithms
are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial
in roany computer sCience applications. Although GPUs can accelerate the computation
of such algorithms, they can also be a limitation when they do not match
correctly to the CPU architecture or do not exploit the CPU parallelism properly.
This dissertation presents two different perspectives. Gn the Oile hand, new
parallel prefix algorithms have been algorithmicany designed for any paranel progrannning
paradigm. On the other hand, a general tuning CPU methodology is
proposed to provide an easy and portable mechanism t煤 efficiently implement paranel
prefix algorithms on any CUDA CPU architecture, rather than focusing on a
particular algorithm or a CPU mode!. To accomplish this goal, the methodology
identifies the GPU parameters which influence on the performance and, following a
set o铆 performance premises, obtains the cOllvillient values o铆 these parameters depending
on the algorithm, the problem size and the CPU architecture. Additionally,
the provided CPU functions are composed of modular and reusable CUDA blocks
of code, which allow the easy implementation of any paranel prefix algorithm. Depending
on the size of the dataset, three different approaches are proposed. The first
two approaches solve small and medium-large datasets on a single GPU; whereas the
third approach deals with extremely large datasets on a Multiple-CPU environment.
OUT proposals provide very competitive performance, outperforming the stateof-
the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems
Evaluating technologies and techniques for transitioning hydrodynamics applications to future generations of supercomputers
Current supercomputer development trends present severe challenges for scientific codebases. Moore鈥檚 law continues to hold, however, power constraints have brought an end to Dennard scaling, forcing significant increases in overall concurrency. The performance imbalance between the processor and memory sub-systems is also increasing and architectures are becoming significantly more complex. Scientific computing centres need to harness more computational resources in order to facilitate new scientific insights and maintaining their codebases requires significant investments. Centres therefore have to decide how best to develop their applications to take advantage of future architectures. To prevent vendor "lock-in" and maximise investments, achieving portableperformance across multiple architectures is also a significant concern.
Efficiently scaling applications will be essential for achieving improvements in science and the MPI (Message Passing Interface) only model is reaching its scalability limits. Hybrid approaches which utilise shared memory programming models are a promising approach for improving scalability. Additionally PGAS (Partitioned Global Address Space) models have the potential to address productivity and scalability concerns. Furthermore, OpenCL has been developed with the aim of enabling applications to achieve portable-performance across a range of heterogeneous architectures.
This research examines approaches for achieving greater levels of performance for hydrodynamics applications on future supercomputer architectures. The development of a Lagrangian-Eulerian hydrodynamics application is presented together with its utility for conducting such research. Strategies for improving application performance, including PGAS- and hybrid-based approaches are evaluated at large node-counts on several state-of-the-art architectures. Techniques to maximise the performance and scalability of OpenMP-based hybrid implementations are presented together with an assessment of how these constructs should be combined with existing approaches. OpenCL is evaluated as an additional technology for implementing a hybrid programming model and improving performance-portability. To enhance productivity several tools for automatically hybridising applications and improving process-to-topology mappings are evaluated.
Power constraints are starting to limit supercomputer deployments, potentially necessitating the use of more energy efficient technologies. Advanced processor architectures are therefore evaluated as future candidate technologies, together with several application optimisations which will likely be necessary. An FPGA-based solution is examined, including an analysis of how effectively it can be utilised via a high-level programming model, as an alternative to the specialist approaches which currently limit the applicability of this technology
Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems
[Resumen]El inter茅s en Java para computaci贸n paralela est谩 motivado por sus interesantes
caracter铆sticas, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional.
No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos
de comunicaci贸n eficientes, los cuales utilizan a menudo protocolos basados
en sockets incapaces de obtener el m谩ximo provecho de las redes de baja latencia,
obstaculizando la adopci贸n de Java en computaci贸n de altas prestaciones (High Per-
formance Computing, HPC). Esta Tesis Doctoral presenta el dise帽o, implementaci贸n
y evaluaci贸n de soluciones de comunicaci贸n en Java que superan esta limitaci贸n. En
consecuencia, se desarrollaron m煤ltiples dispositivos de comunicaci贸n a bajo nivel
para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al
m谩ximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. Tambi茅n se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la
cual se integraron los dispositivos de comunicaci贸n. La evaluaci贸n experimental ha
mostrado que las primitivas de comunicaci贸n de FastMPJ son competitivas en comparaci贸n con bibliotecas nativas, aumentando significativamente la escalabilidad de
aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computaci贸n en
la nube (cloud computing) para HPC, donde el modelo de distribuci贸n de infraestructura
como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa
viable a los sistemas HPC tradicionales. La evaluaci贸n del rendimiento de recursos
cloud espec铆ficos para HPC del proveedor l铆der, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualizaci贸n impone en la red, impidiendo
mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualizaci贸n apropiado, como el acceso directo al hardware de red, junto
con las directrices para la optimizaci贸n del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computaci贸n paralela est谩 motivado polas s煤as interesantes caracter铆sticas, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicaci贸ns paralelas en Java carecen xeralmente de mecanismos de comunicaci贸n e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o m谩ximo proveito das redes de baixa latencia, obstaculizando a adopci贸n de Java na computaci贸n de altas prestaci贸ns (High
Performance Computing, HPC). Esta Tese de Doutoramento presenta o dese帽o, implementaci
贸n e avaliaci贸n de soluci贸ns de comunicaci贸n en Java que superan esta limitaci贸n. En consecuencia, desenvolv茅ronse m煤ltiples dispositivos de comunicaci贸n a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao m谩aximo o hardware de rede subxacente mediante operaci贸ns de acceso
directo a memoria remota que proporcionan comunicaci贸ns de baixa latencia.
Tam茅n se incl煤e unha biblioteca de paso de mensaxes en Java totalmente funcional,
FastMPJ, na cal foron integrados os dispositivos de comunicaci贸n. A avaliaci贸n experimental amosou que as primitivas de comunicaci贸n de FastMPJ son competitivas
en comparaci贸n con bibliotecas nativas, aumentando signi cativamente a escalabilidade
de aplicaci贸ns MPJ. Por outra banda, esta Tese analiza o potencial da computaci贸n na nube (cloud computing) para HPC, onde o modelo de distribuci贸n de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliaci贸n do rendemento de recursos cloud espec铆fi cos para HPC do proveedor l铆der, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualizaci贸n imp贸n na rede, impedindo mover as aplicaci贸ns intensivas en comunicaci贸ns 谩 nube. A clave at贸pase no soporte de virtualizaci贸n apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimizaci贸n do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to
its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er
from inefficient communication middleware, most of which use socket-based protocols
that are unable to take full advantage of high-speed networks, hindering the
adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis
presents the design, development and evaluation of scalable Java communication
solutions that overcome these constraints. Hence, we have implemented several lowlevel
message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis
has analyzed the potential of cloud computing towards spreading the outreach of
HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible
alternative to traditional HPC systems. Several cloud resources from the leading
IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been
thoroughly assessed. The experimental results have shown the signi cant impact
that virtualized environments still have on network performance, which hampers
porting communication-intensive codes to the cloud. The key is the availability of
the proper virtualization support, such as the direct access to the network hardware,
along with the guidelines for performance optimization suggested in this Thesis
Leveraging performance of 3D finite difference schemes in large scientific computing simulations
Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance.
This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache.
As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment.
Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atr谩s quedaron los d铆as en los que ingenieros y cient铆ficos realizaban sus experimentos emp铆ricamente. Durante esas d茅cadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos te贸ricos. Con la llegada de la era computacional, la computaci贸n cient铆fica se ha convertido en una soluci贸n factible comparada con m茅todos emp铆ricos, en t茅rminos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados num茅ricos gracias al refinamiento del dominio. Diversos m茅todos num茅ricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). M茅todos como Elementos Finitos (EF) y Vol煤menes Finitos (VF) est谩n bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias m谩s altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una soluci贸n eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computaci贸n cient铆fica. Se proponen diferentes t茅cnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con t茅cnicas como spatial- o time-blocking, a帽adiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-sim茅tricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbaci贸n del problema del ancho de memoria. Para paliar este problema, nuestra investigaci贸n se centra en estrategias para reducir la presi贸n en la jerarqu铆a de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposici贸n de dominios para balancear la carga asegurando resultados casi 贸ptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con t茅cnicas de spatial-blocking y auto-tuning, explorando el espacio param茅trico y reduciendo los fallos en la cache de 煤ltimo nivel. Como alternativa a los m茅todos de fuerza bruta usados en auto-tuning donde un espacio param茅trico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una soluci贸n factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando par谩metros suboptimos casi de forma instant谩nea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo caracter铆sticas complejas como prefetchers, SMT y optimizaciones algor铆tmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecuci贸n, sino tambi茅n para tomar decisiones de los mejores par谩metros algor铆tmicos. Adem谩s, puede ser incluido en optimizadores run-time para decidir la mejor configuraci贸n SMT. Algunas industrias conf铆an en t茅cnicas DF para sus c贸digos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigaci贸n. En este aspecto, hemos dise帽ado e implementado desde cero una infraestructura DF que cubre las caracter铆sticas m谩s importantes que una aplicaci贸n industrial debe incluir. Algunas de las t茅cnicas de optimizaci贸n propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estrat茅gicas para la industria: un modelo de transporte atmosf茅rico que simula la dispersi贸n de ceniza volc谩nica y un modelo de imagen s铆smica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo