3 research outputs found
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration