182 research outputs found
A framework for efficient execution of data parallel irregular applications on heterogeneous systems
Exploiting the computing power of the diversity of resources available on heterogeneous
systems is mandatory but a very challenging task. The diversity of architectures, execution
models and programming tools, together with disjoint address spaces and di erent
computing capabilities, raise a number of challenges that severely impact on application
performance and programming productivity. This problem is further compounded in the
presence of data parallel irregular applications.
This paper presents a framework that addresses development and execution of data
parallel irregular applications in heterogeneous systems. A uni ed task-based programming
and execution model is proposed, together with inter and intra-device scheduling,
which, coupled with a data management system, aim to achieve performance scalability
across multiple devices, while maintaining high programming productivity. Intradevice
scheduling on wide SIMD/SIMT architectures resorts to consumer-producer kernels,
which, by allowing dynamic generation and rescheduling of new work units, enable
balancing irregular workloads and increase resource utilization.
Results show that regular and irregular applications scale well with the number of
devices, while requiring minimal programming e ort. Consumer-producer kernels are
able to sustain signi cant performance gains as long as the workload per basic work
unit is enough to compensate overheads associated with intra-device scheduling. This
not being the case, consumer kernels can still be used for the irregular application.
Comparisons with an alternative framework, StarPU, which targets regular workloads,
consistently demonstrate signi cant speedups. This is, to the best of our knowledge, the
rst published integrated approach that successfully handles irregular workloads over
heterogeneous systems.This work is funded by National Funds through the FCT - Fundação para a Ciência
e a Tecnologia (Portuguese Foundation for Science and Technology) and by ERDF -
European Regional Development Fund through the COMPETE Programme (operational
programme for competitiveness) within projects PEst-OE/EEI/UI0752/2014
and FCOMP-01-0124-FEDER-010067. Also by the School of Engineering, Universidade
do Minho within project P2SHOCS - Performance Portability on Scalable
Heterogeneous Computing Systems
GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP
Full detector simulation was among the largest CPU consumer in all CERN
experiment software stacks for the first two runs of the Large Hadron Collider
(LHC). In the early 2010's, the projections were that simulation demands would
scale linearly with luminosity increase, compensated only partially by an
increase of computing resources. The extension of fast simulation approaches to
more use cases, covering a larger fraction of the simulation budget, is only
part of the solution due to intrinsic precision limitations. The remainder
corresponds to speeding-up the simulation software by several factors, which is
out of reach using simple optimizations on the current code base. In this
context, the GeantV R&D project was launched, aiming to redesign the legacy
particle transport codes in order to make them benefit from fine-grained
parallelism features such as vectorization, but also from increased code and
data locality. This paper presents extensively the results and achievements of
this R&D, as well as the conclusions and lessons learnt from the beta
prototype.Comment: 34 pages, 26 figures, 24 table
Dynamic task scheduling and binding for many-core systems through stream rewriting
This thesis proposes a novel model of computation, called stream rewriting, for the specification and implementation of highly concurrent applications. Basically, the active tasks of an application and their dependencies are encoded as a token stream, which is iteratively modified by a set of rewriting rules at runtime. In order to estimate the performance and scalability of stream rewriting, a large number of experiments have been evaluated on many-core systems and the task management has been implemented in software and hardware.In dieser Dissertation wurde Stream Rewriting als eine neue Methode entwickelt, um Anwendungen mit einer großen Anzahl von dynamischen Tasks zu beschreiben und effizient zur Laufzeit verwalten zu können. Dabei werden die aktiven Tasks in einem Datenstrom verpackt, der zur Laufzeit durch wiederholtes Suchen und Ersetzen umgeschrieben wird. Um die Performance und Skalierbarkeit zu bestimmen, wurde eine Vielzahl von Experimenten mit Many-Core-Systemen durchgeführt und die Verwaltung von Tasks über Stream Rewriting in Software und Hardware implementiert
A metadata-enhanced framework for high performance visual effects
This thesis is devoted to reducing the interactive latency of image processing computations in
visual effects. Film and television graphic artists depend upon low-latency feedback to receive
a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising
compiler which leverages high-level program metadata to guide key computational and
memory hierarchy optimisations. This metadata encodes static and dynamic information about
data dependence and patterns of memory access in the algorithms constituting a visual effect –
features that are typically difficult to extract through program analysis – and presents it to the
compiler in an explicit form. By using domain-specific information as a substitute for program
analysis, our compiler is able to target a set of complex source-level optimisations that a vendor
compiler does not attempt, before passing the optimised source to the vendor compiler for
lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of
space and schedule optimisation – based upon well-known compositions of the loop fusion and
array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised
visual effect. This adaptation sidesteps the costly solution of runtime code generation
by specialising static parameters in an offline process and exploiting dynamic metadata to
adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second
optimisation comprises a set of transformations to generate SIMD ISA-augmented source code.
Our approach differs from autovectorisation by using static metadata to identify parallelism, in
place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable
parameters for optimal aligned memory access. The third optimisation comprises a related set
of transformations to generate code for SIMT architectures, such as GPUs. Static dependence
metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads.
Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying
inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed
visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD
multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations
of these two effects. Programmability is enhanced by automating the generation of
SIMD and SIMT implementations from a single programmer-managed scalar representation
Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis
Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers.
Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks
Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis
Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers.
Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks
On the MC/DC code coverage of Vulkan SC GPU code
Next generation avionics systems require high performance, which can be provided by graphics processing units (GPUs). The newly introduced API Vulkan SC, enables the development of safety critical GPU software with complex control flow, whose certification is subject to DO-178C certifiability objectives, such as MC/DC code coverage.In this paper we explain for the first time how MC/DC coverage can be applied in Vulkan SC code as well as the type of potential development errors which can arise in GPU programming. We show how GPU code can be converted in equivalent sequential CPU code and how both versions can achieve 100% MC/DC code coverage.This work was performed within the Airbus TANIAGPU Project ADS (E/200). It was also partially supported by the European Space Agency (ESA) through the GPU4S (GPU for Space) activity, the Spanish Ministry of Economy and Competitiveness under grants PID2019- 107255GB-C21 and IJC-2020-045931-I ( Spanish State Research Agency / Agencia Espanola de Investigación (AEI) / http://dx.doi.org/10.13039/501100011033 ) and the HiPEAC Network of ExcellencePeer ReviewedPostprint (author's final draft
SCALABLE TECHNIQUES FOR SCHEDULING AND MAPPING DSP APPLICATIONS ONTO EMBEDDED MULTIPROCESSOR PLATFORMS
A variety of multiprocessor architectures has proliferated even for off-the-shelf computing platforms. To make use of these platforms, traditional implementation frameworks focus on implementing Digital Signal Processing (DSP) applications using special platform features to achieve high performance. However, due to the fast evolution of the underlying architectures, solution redevelopment is error prone and re-usability of existing solutions and libraries is limited. In this thesis, we facilitate an efficient migration of DSP systems to multiprocessor platforms while systematically leveraging previous investment in optimized library kernels using dataflow design frameworks. We make these library elements, which are typically tailored to specialized architectures, more amenable to extensive analysis and optimization using an efficient and systematic process. In this thesis we provide techniques to allow such migration through four basic contributions:
1. We propose and develop a framework to explore efficient utilization of Single Instruction Multiple Data (SIMD) cores and accelerators available in heterogeneous multiprocessor platforms consisting of General Purpose Processors (GPPs) and Graphics Processing Units (GPUs). We also propose new scheduling techniques by applying extensive block processing in conjunction with appropriate task mapping and task ordering methods that match efficiently with the underlying architecture. The approach gives the developer the ability to prototype a GPU-accelerated application and explore its design space efficiently and effectively.
2. We introduce the concept of Partial Expansion Graphs (PEGs) as an implementation model and associated class of scheduling strategies. PEGs are designed to help realize DSP systems in terms of forms and granularities of parallelism that are well matched to the given applications and targeted platforms. PEGs also facilitate derivation of both static and dynamic scheduling techniques, depending on the amount of variability in task execution times and other operating conditions. We show how to implement efficient PEG-based scheduling methods using real time operating systems, and to re-use pre-optimized libraries of DSP components within such implementations.
3. We develop new algorithms for scheduling and mapping systems implemented using PEGs. Collectively, these algorithms operate in three steps. First, the amount of data parallelism in the application graph is tuned systematically over many iterations to profit from the available cores in the target platform. Then a mapping algorithm that uses graph analysis is developed to distribute data and task parallel instances over different cores while trying to balance the load of all processing units to make use of pipeline parallelism. Finally, we use a novel technique for performance evaluation by implementing the scheduler and a customizable solution on the programmable platform. This allows accurate fitness functions to be measured and used to drive runtime adaptation of schedules.
4. In addition to providing scheduling techniques for the mentioned applications and platforms, we also show how to integrate the resulting solution in the underlying environment. This is achieved by leveraging existing libraries and applying the GPP-GPU scheduling framework to augment a popular existing Software Defined Radio (SDR) development environment -- GNU Radio -- with a dataflow foundation and a stand-alone GPU-accelerated library. We also show how to realize the PEG model on real time operating system libraries, such as the Texas Instruments DSP/BIOS. A code generator that accepts a manual system designer solution as well as automatically configured solutions is provided to complete the design flow starting from application model to running system
MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data
The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range
of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus
(OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm.
In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads.
Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent).
Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen-
tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this
program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs
- …