788 research outputs found
Cooperative auto-tuning of parallel skeletons
Improving program performance through the use of multiple homogeneous processing
elements, or cores, is common-place. However, these architectures increase the
complexity required at the software level. Existing work is focused on optimising
programs that run in isolation on these systems, but ignores the fact that, in reality,
these systems run multiple parallel programs concurrently with programs competing
for system resources. In order to improve performance in this shared environment,
cooperative tuning of multiple, concurrently running parallel programs is required.
Moreover, the set of programs running on the system – the system workload – is dynamic
and rapidly changing. This makes cooperative tuning a challenge, as it must
react rapidly to changes in the system workload.
This thesis explores the scope for performance improvement from cooperatively
tuning skeleton parallel programs, and techniques that can be used to cooperatively
auto-tune parallel programs. Parallel skeletons provide a clear separation between
algorithm description and implementation, and provide tuning knobs that the system
can use to make high-level changes to a programs implementation. This work
is in three parts: (i) how many threads should be allocated to each program running
on the system, (ii) on which cores should a programs threads be executed and
(iii) what values should be chosen for high-level parameters of the parallel skeletons.
We demonstrate that significant performance improvements are available in each of
these areas, compared to the current state-of-the-art
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
Challenging the abstraction penalty in parallel patterns libraries
In the last years, pattern-based programming has been recognized as a good practice for efficiently exploiting parallel hardware resources. Following this approach, multiple libraries have been designed for providing such high-level abstractions to ease the parallel programming. However, those libraries do not share a common interface. To pave the way, GrPPI has been designed for providing an intermediate abstraction layer between application developers and existing parallel programming frameworks like OpenMP, Intel TBB or ISO C++ threads. On the other hand, FastFlow has been adopted as an efficient object-based programming framework that may benefit from being supported as an additional GrPPI backend. However, the object-based approach presents some major challenges to be incorporated under the GrPPI type safe functional programming style. In this paper, we present the integration of FastFlow as a new GrPPI backend to demonstrate that structured parallel programming frameworks perfectly fit the GrPPI design. Additionally, we also demonstrate that GrPPI does not incur in additional overheads for providing its abstraction layer, and we study the programmability in terms of lines of code and cyclomatic complexity. In general, the presented work acts as reciprocal validation of both FastFlow (as an efficient, native structured parallel programming framework) and GrPPI (as an efficient abstraction layer on top of existing parallel programming frameworks).This work has been partially supported by the European Commission EU H2020-ICT-2014-1 Project RePhrase (No. 644235) and by the Spanish Ministry of Economy and Competitiveness through TIN2016-79637-P “Towards Unification of HPC and Big Data Paradigms”
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
We study the problem of human action recognition using motion capture (MoCap)
sequences. Unlike existing techniques that take multiple manual steps to derive
standardized skeleton representations as model input, we propose a novel
Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The model uses a hierarchical transformer with intra-frame off-set attention
and inter-frame self-attention. The attention mechanism allows the model to
freely attend between any two vertex patches to learn non-local relationships
in the spatial-temporal domain. Masked vertex modeling and future frame
prediction are used as two self-supervised tasks to fully activate the
bi-directional and auto-regressive attention in our hierarchical transformer.
The proposed method achieves state-of-the-art performance compared to
skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is
available at https://github.com/zgzxy001/STMT.Comment: CVPR 202
Autonomic behavioural framework for structural parallelism over heterogeneous multi-core systems.
With the continuous advancement in hardware technologies, significant research has been devoted to design and develop high-level parallel programming models that allow programmers to exploit the latest developments in heterogeneous multi-core/many-core architectures. Structural programming paradigms propose a viable solution for e ciently programming modern heterogeneous multi-core architectures equipped with one or more programmable Graphics Processing Units (GPUs). Applying structured programming paradigms, it is possible to subdivide a system into building blocks (modules, skids or components) that can be independently created and then used in di erent systems to derive multiple functionalities. Exploiting such systematic divisions, it is possible to address extra-functional features such as application performance, portability and resource utilisations from the component level in heterogeneous multi-core architecture. While the computing function of a building block can vary for di erent applications, the behaviour (semantic) of the block remains intact. Therefore, by understanding the behaviour of building blocks and their structural compositions in parallel patterns, the process of constructing and coordinating a structured application can be automated. In this thesis we have proposed Structural Composition and Interaction Protocol (SKIP) as a systematic methodology to exploit the structural programming paradigm (Building block approach in this case) for constructing a structured application and extracting/injecting information from/to the structured application. Using SKIP methodology, we have designed and developed Performance Enhancement Infrastructure (PEI) as a SKIP compliant autonomic behavioural framework to automatically coordinate structured parallel applications based on the extracted extra-functional properties related to the parallel computation patterns. We have used 15 di erent PEI-based applications (from large scale applications with heavy input workload that take hours to execute to small-scale applications which take seconds to execute) to evaluate PEI in terms of overhead and performance improvements. The experiments have been carried out on 3 di erent Heterogeneous (CPU/GPU) multi-core architectures (including one cluster machine with 4 symmetric nodes with one GPU per node and 2 single machines with one GPU per machine). Our results demonstrate that with less than 3% overhead, we can achieve up to one order of magnitude speed-up when using PEI for enhancing application performance
FastFlow: Efficient Parallel Streaming Applications on Multi-core
Shared memory multiprocessors come back to popularity thanks to rapid
spreading of commodity multi-core architectures. As ever, shared memory
programs are fairly easy to write and quite hard to optimise; providing
multi-core programmers with optimising tools and programming frameworks is a
nowadays challenge. Few efforts have been done to support effective streaming
applications on these architectures. In this paper we introduce FastFlow, a
low-level programming framework based on lock-free queues explicitly designed
to support high-level languages for streaming applications. We compare FastFlow
with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel
TBB. We experimentally demonstrate that FastFlow is always more efficient than
all of them in a set of micro-benchmarks and on a real world application; the
speedup edge of FastFlow over other solutions might be bold for fine grain
tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the
alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove
Autotuning wavefront patterns for heterogeneous architectures
Manual tuning of applications for heterogeneous parallel systems is tedious and complex.
Optimizations are often not portable, and the whole process must be repeated when moving
to a new system, or sometimes even to a different problem size.
Pattern based parallel programming models were originally designed to provide programmers
with an abstract layer, hiding tedious parallel boilerplate code, and allowing a focus on
only application specific issues. However, the constrained algorithmic model associated with
each pattern also enables the creation of pattern-specific optimization strategies. These can
capture more complex variations than would be accessible by analysis of equivalent unstructured
source code. These variations create complex optimization spaces. Machine learning
offers well established techniques for exploring such spaces.
In this thesis we use machine learning to create autotuning strategies for heterogeneous
parallel implementations of applications which follow the wavefront pattern. In a wavefront,
computation starts from one corner of the problem grid and proceeds diagonally like a wave
to the opposite corner in either two or three dimensions. Our framework partitions and
optimizes the work created by these applications across systems comprising multicore CPUs
and multiple GPU accelerators. The tuning opportunities for a wavefront include controlling
the amount of computation to be offloaded onto GPU accelerators, choosing the number of
CPU and GPU threads to process tasks, tiling for both CPU and GPU memory structures,
and trading redundant halo computation against communication for multiple GPUs.
Our exhaustive search of the problem space shows that these parameters are very sensitive
to the combination of architecture, wavefront instance and problem size. We design and
investigate a family of autotuning strategies, targeting single and multiple CPU + GPU
systems, and both two and three dimensional wavefront instances. These yield an average
of 87% of the performance found by offline exhaustive search, with up to 99% in some cases
- …