180 research outputs found
CampProf: A Visual Performance Analysis Tool for Memory Bound GPU Kernels
Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge these performance models, by modeling and analyzing a lesser known, but very severe performance pitfall, called 'Partition Camping', in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of memory-bound CUDA kernels by up to seven-times. No existing tool can detect the partition camping effect in CUDA kernels.
We complement the existing tools by developing 'CampProf', a spreadsheet based, visual analysis tool, that detects the degree to which any memory-bound kernel suffers from partition camping. In addition, CampProf also predicts the kernel's performance at all execution configurations, if its performance parameters are known at any one of them. To demonstrate the utility of CampProf, we analyze three different applications using our tool, and demonstrate how it can be used to discover partition camping. We also demonstrate how CampProf can be used to monitor the performance improvements in the kernels, as the partition camping effect is being removed.
The performance model that drives CampProf was developed by applying multiple linear regression techniques over a set of specific micro-benchmarks that simulated the partition camping behavior. Our results show that the geometric mean of errors in our prediction model is within 12% of the actual execution times. In summary, CampProf is a new, accurate, and easy-to-use tool that can be used in conjunction with the existing tools to analyze and improve the overall performance of memory-bound CUDA kernels
Improving the programmability of heterogeneous systems by means of libraries
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
O emprego de dispositivos heteroxéneos coma co-procesadores en entornos de
computación de altas prestacións (HPC) medrou ininterrompidamente nos últimos
anos debido ás súas excelentes propiedades en termos de rendemento e consumo de
enerx:ía. A ma.ior dispoñibilidade de sistemas HPC híbridos conlevou de forma natural
a necesidade de desenrolar ferra.mentas de programación adecuadas para eles,
sendo CUDA e OpenCL as máis a.mplamente empregadas na actualidade. Desafortunadamente,
estas ferramentas son relativamente de baixo nivel, o cal emparellado co
ma.ior número de detalles que deben de ser controlados cando se programan aceleradoras,
fa.i da programación destes sistemas mediante elas, moito roáis complexa que a.
programación tradicional de CPUs. Isto levou á. proposta de alternativas de roáis alto
nivel para facilitar a programación de dispositivos heteroxéneos. Esta tesis contribúe
neste campo presentando dúas libreríe.<i que mellora.n ampla.mente a programabilidade
de sistemas heteroxéneos en C++, permitindo aos usuarios centrarse no que hai
que facer en vez de nas tarefas de baixo nivel. As nosas propostas, a librería. Heterogeneous
Progromming Libmry (HPL) e a. librería Heterogene.ous Hiemrchically Tiled
Arrays (H2TA), están deseñadas para nodos con unha ou má.is aceleradoras, e para
clusters heteroxéneos, respectivamente. Ambas librerías, demostraron ser capaces de
incrementar a. productividade dos usuarios mellora.ndo a programabilidade dos sem;
códigos, e ó mesmo tempo, lograr un rendemento semella.nte ó de solucións de roáis
baixo nivel.[Abstract]
The usage of heterogeneous devices as co-processors in high performance computing
(HPC) environments has steadily grown during the last years due to their
excellent properties in terms of perfonnance and energy consumption. The larger
a.vailability of hybrid HPC systems naturally led to the need to develop suitable
programming tools for them, being the most widely tL'ied nowadays CUDA and
OpenCL. Unfortlmatciy, these tools are relativcly low leve), which coupled with the
large DUlllber of deta.ils that must be monaged when programming accelerators, makes
the programm.ing of these systems using them much more complex thon that
of trad.itional CPUs. This has led to the proposal of higher leve) alternatives that
facilitate the progranuning of heterogeneous devices. This thesis contri bu tes to this
field presenting two libraries that largely improve the programma.bility of heterogeneous
systeins in C++, helping users to focus on what todo rather thtlJl onlow leve)
tasks. These two libraries, the Heterogeneous Programming Library (HPL) and the
Heterogeneous Hierarch.ically Tiled Arrays (H2TA), are well suited to nodes with
one or more accelerators, a.nd to heterogeneous clusters, respectively. Both libraries
have proveo to be able to incresse the productivity of the users improving the progro.
mmability of their codes, and at the s8llle time, achieving performance similar
to that of lower leve) solutions.[Resumen]
El empleo de dispositivos heterogéneos como co-procesadores en entornos de
computación de altas prestaciones (HPC) ha. crecido ininterrumpidamente durante
los últimos años debido a. sus excelentes propiedades en términos de rendimiento y
consumo de energía. La mayor disponibilidad de sistemas HPC híbridos conllevó de
forma natural la necesidad de desarrollar herramientas de programación adecuadas
para. ellos, siendo CUDA y OpenCL las más ampliamente utilizadas en la actualidad.
Desafortunadamente, estas herramientas son relativamente de bajo nivel, lo cual
emparejado con el mayor número de detalles que han de ser controlados cuando se
programan aceleradoras, hacen de la programación de estos sistemas mediante ell8S
mucho más compleja que la programación tradicional de CPUs. Esto ha llevado a la
propuesta de alternativ8S de más alto nivel para facilitar la programación de dispositivos
heterogéneos. Esta tesis contribuye a este campo presentando dos librerías que
mejoran ampliamente la programabilidad de sistemas heterogéneos en C++, permitiendo
a los usuarios centrarse en lo que hay que hacer en vez de en las tareas de bajo
nivel. Nuestras propuestas, la librería Heterogeneous Progromming Librory (HPL) y
la librería Heterogeneous Hierorchíoolly Tíled Arrays (H2TA), están diseñadas para
nodos con una o más aceleradoras, y para clusters heterogéneos, respectivamente.
Ambas librerías, han demostrado ser capaces de incrementar la productividad de los
usuarios mejorando la programabilidad de sus códigos, y al mismo tiempo, lograr
un rendimiento similar al de soluciones de más bajo nivel
GPGPU Reliability Analysis: From Applications to Large Scale Systems
Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications
Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework
The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs.
This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification.
The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization.
The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis
Recommended from our members
AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs
Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs.
This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor.
The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.
To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled
Heterogeneous computing with an algorithmic skeleton framework
The Graphics Processing Unit (GPU) is present in almost every modern day personal
computer. Despite its specific purpose design, they have been increasingly used for general
computations with very good results. Hence, there is a growing effort from the community
to seamlessly integrate this kind of devices in everyday computing. However, to
fully exploit the potential of a system comprising GPUs and CPUs, these devices should
be presented to the programmer as a single platform.
The efficient combination of the power of CPU and GPU devices is highly dependent
on each device’s characteristics, resulting in platform specific applications that cannot
be ported to different systems. Also, the most efficient work balance among devices is
highly dependable on the computations to be performed and respective data sizes.
In this work, we propose a solution for heterogeneous environments based on the
abstraction level provided by algorithmic skeletons. Our goal is to take full advantage of
the power of all CPU and GPU devices present in a system, without the need for different
kernel implementations nor explicit work-distribution.To that end, we extended Marrow,
an algorithmic skeleton framework for multi-GPUs, to support CPU computations and
efficiently balance the work-load between devices. Our approach is based on an offline
training execution that identifies the ideal work balance and platform configurations for
a given application and input data size.
The evaluation of this work shows that the combination of CPU and GPU devices can
significantly boost the performance of our benchmarks in the tested environments, when
compared to GPU-only executions
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community
In recent years, deep learning (DL), a re-branding of neural networks (NNs),
has risen to the top in numerous areas, namely computer vision (CV), speech
recognition, natural language processing, etc. Whereas remote sensing (RS)
possesses a number of unique challenges, primarily related to sensors and
applications, inevitably RS draws from many of the same theories as CV; e.g.,
statistics, fusion, and machine learning, to name a few. This means that the RS
community should be aware of, if not at the leading edge of, of advancements
like DL. Herein, we provide the most comprehensive survey of state-of-the-art
RS DL research. We also review recent new developments in the DL field that can
be used in DL for RS. Namely, we focus on theories, tools and challenges for
the RS community. Specifically, we focus on unsolved challenges and
opportunities as it relates to (i) inadequate data sets, (ii)
human-understandable solutions for modelling physical phenomena, (iii) Big
Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and
learning algorithms for spectral, spatial and temporal data, (vi) transfer
learning, (vii) an improved theoretical understanding of DL systems, (viii)
high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote
Sensin
Matching non-uniformity for program optimizations on heterogeneous many-core systems
As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
Computational methods and software for the design of inertial microfluidic flow sculpting devices
The ability to sculpt inertially flowing fluid via bluff body obstacles has enormous promise for applications in bioengineering, chemistry, and manufacturing within microfluidic devices. However, the computational difficulty inherent to full scale 3-dimensional fluid flow simulations makes designing and optimizing such systems tedious, costly, and generally tasked to computational experts with access to high performance resources. The goal of this work is to construct efficient models for the design of inertial microfluidic flow sculpting devices, and implement these models in freely available, user-friendly software for the broader microfluidics community. Two software packages were developed to accomplish this: uFlow and FlowSculpt . uFlow solves the forward problem in flow sculpting, that of predicting the net deformation from an arbitrary sequence of obstacles (pillars), and includes estimations of transverse mass diffusion and particles formed by optical lithography. FlowSculpt solves the more difficult inverse problem in flow sculpting, which is to design a flow sculpting device which produces a target flow shape. Each piece of software uses efficient, experimentally validated forward models developed within this work, which are applied to deep learning techniques to explore other routes to solving the inverse problem. The models are also highly modular, capable of incorporating new microfluidic components and flow physics to the design process. It is anticipated that the microfluidics community will integrate the tools developed here into their own research, and bring new designs, components, and applications to the inertial flow sculpting platform
- …