2,673 research outputs found
Indexed dependence metadata and its applications in software performance optimisation
To achieve continued performance improvements, modern microprocessor design is tending to concentrate
an increasing proportion of hardware on computation units with less automatic management
of data movement and extraction of parallelism. As a result, architectures increasingly include multiple
computation cores and complicated, software-managed memory hierarchies. Compilers have
difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic
generation of efficient code in any but the most straightforward of cases.
We propose the concept of indexed dependence metadata to improve application development and
mapping onto such architectures. The metadata represent both the iteration space of a kernel and the
mapping of that iteration space from a given index to the set of data elements that iteration might
use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping
allows the compiler or runtime to optimise the program more efficiently, and improves the program
structure for the developer. We argue that this form of explicit interface specification reduces the need
for premature, architecture-specific optimisation. It improves program portability, supports intercomponent
optimisation and enables generation of efficient data movement code.
We offer the following contributions: an introduction to the concept of indexed dependence metadata
as a generalisation of stream programming, a demonstration of its advantages in a component
programming system, the decoupled access/execute model for C++ programs, and how indexed dependence
metadata might be used to improve the programming model for GPU-based designs. Our
experimental results with prototype implementations show that indexed dependence metadata supports
automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive
loop fusion optimisations in image processing, linear algebra and multigrid application case
studies
Scratchpad Management in Software Managed Manycore Architectures
abstract: Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++
Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme
edge of the Internet-of-Things is a critical enabler to support pervasive Deep
Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited
on-chip memory and often replace caches with scratchpads, to reduce area
overheads and increase energy efficiency -- requiring explicit DMA-based memory
transfers between different levels of the memory hierarchy. Mapping modern DNNs
on these systems requires aggressive topology-dependent tiling and
double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY)
- an automatic tool to deploy DNNs on low cost MCUs with typically less than
1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming
(CP) problem: it maximizes L1 memory utilization under the topological
constraints imposed by each DNN layer. Then, it generates ANSI C code to
orchestrate off- and on-chip transfers and computation phases. Furthermore, to
maximize speed, DORY augments the CP formulation with heuristics promoting
performance-effective tile sizes. As a case study for DORY, we target
GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power
MCU-class devices on the market. On this device, DORY achieves up to 2.5x
better MAC/cycle than the GreenWaves proprietary software solution and 18.1x
better than the state-of-the-art result on an STM32-F746 MCU on single layers.
Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128
network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an
STM32-F746. We release all our developments - the DORY framework, the optimized
backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication
in IEEE Transactions on Computers
(https://ieeexplore.ieee.org/document/9381618
- …