41 research outputs found
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
Mixed-data-model heterogeneous compilation and OpenMP offloading
Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance
Response-time analysis of DAG tasks supporting heterogeneous computing
Hardware platforms are evolving towards parallel and heterogeneous architectures to overcome the increasing necessity of more performance in the real-time domain. Parallel programming models are fundamental to exploit the performance capabilities of these architectures. This paper proposes a novel response time analysis (RTA) for verifying the schedulability of DAG tasks supporting heterogeneous computing. It analyzes the impact of executing part of the DAG in the accelerator device. As a result, the response time upper bound of the system is more precise than the one provided by currently existing RTA targeting homogeneous architectures.This work is supported by the Spanish Ministry of Science and Innovation under contract TIN2015-65316-PPeer ReviewedPostprint (published version
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming next-generation accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure
Modelli e strumenti di programmazione parallela per piattaforme many-core
The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains.
Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors.
Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC.
This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms.
The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms.
First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism.
Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems.
Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently.
All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms
High Performance Embedded Computing
Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systemsThe work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things
High-Performance and Time-Predictable Embedded Computing
Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc.
High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds.
Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systems
The work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things.info:eu-repo/semantics/publishedVersio
Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking
In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation