4 research outputs found
Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor
In the construction of exascale computing systems energy efficiency and power
consumption are two of the major challenges. Low-power high performance
embedded systems are of increasing interest as building blocks for large scale
high- performance systems. However, extracting maximum performance out of such
systems presents many challenges. Various aspects from the hardware
architecture to the programming models used need to be explored. The Epiphany
architecture integrates low-power RISC cores on a 2D mesh network and promises
up to 70 GFLOPS/Watt of processing efficiency. However, with just 32 KB of
memory per eCore for storing both data and code, and only low level inter-core
communication support, programming the Epiphany system presents several
challenges. In this paper we evaluate the performance of the Epiphany system
for a variety of basic compute and communication operations. Guided by this
data we explore strategies for implementing scientific applications on memory
constrained low-powered devices such as the Epiphany. With future systems
expected to house thousands of cores in a single chip, the merits of such
architectures as a path to exascale is compared to other competing systems.Comment: 14 pages, submitted to IJHPCA Journal special editio
A DAQM-Based Load Balancing Scheme for High Performance Computing Platforms
This paper addresses the load balancing problem, which is one of the key issues in high-performance computing (HPC) platforms. A novel method, called decentralized active queue management (DAQM), is proposed to provide a fair task distribution in a heterogeneous computing environment for HPC platforms. An implementation of the DAQM is presented, which consists of an ON-OFF queue control and a utility maximization-based coordination scheme. The stability of the queue control scheme and the convergence of the algorithm for utility maximization have been assessed by rigorous analysis. To demonstrate the performance of the developed queueing control system, numerical simulations are carried out and the obtained results confirm the efficiency and viability of the developed scheme
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
Efficient implementation of resource-constrained cyber-physical systems using multi-core parallelism
The quest for more performance of applications and systems became more challenging in the recent years. Especially in the cyber-physical and mobile domain, the performance requirements increased significantly. Applications, previously found in the high-performance domain, emerge in the area of resource-constrained domain. Modern heterogeneous high-performance MPSoCs provide a solid foundation to satisfy the high demand. Such systems combine general processors with specialized accelerators ranging from GPUs to machine learning chips. On the other side of the performance spectrum, the demand for small energy efficient systems exposed by modern IoT applications increased vastly. Developing efficient software for such resource-constrained multi-core systems is an error-prone, time-consuming and challenging task. This thesis provides with PA4RES a holistic semiautomatic approach to parallelize and implement applications for such platforms efficiently. Our solution supports the developer to find good trade-offs to tackle the requirements exposed by modern applications and systems. With PICO, we propose a comprehensive approach to express parallelism in sequential applications. PICO detects data dependencies and implements required synchronization automatically. Using a genetic algorithm, PICO optimizes the data synchronization. The evolutionary algorithm considers channel capacity, memory mapping, channel merging and flexibility offered by the channel implementation with respect to execution time, energy consumption and memory footprint. PICO's communication optimization phase was able to generate a speedup almost 2 or an energy improvement of 30% for certain benchmarks.
The PAMONO sensor approach enables a fast detection of biological viruses using optical methods. With a sophisticated virus detection software, a real-time virus detection running on stationary computers was achieved.
Within this thesis, we were able to derive a soft real-time capable virus detection running on a high-performance embedded system, commonly found in today's smart phones. This was accomplished with smart DSE algorithm which optimizes for execution time, energy consumption and detection quality. Compared to a baseline implementation, our solution achieved a speedup of 4.1 and 87\% energy savings and satisfied the soft real-time requirements. Accepting a degradation of the detection quality, which still is usable in medical context, led to a speedup of 11.1. This work provides the fundamentals for a truly mobile real-time virus detection solution. The growing demand for processing power can no longer satisfied following well-known approaches like higher frequencies. These so-called performance walls expose a serious challenge for the growing performance demand. Approximate computing is a promising approach to overcome or at least shift the performance walls by accepting a degradation in the output quality to gain improvements in other objectives. Especially for a safe integration of approximation into existing application or during the development of new approximation techniques, a method to assess the impact on the output quality is essential.
With QCAPES, we provide a multi-metric assessment framework to analyze the impact of approximation.
Furthermore, QCAPES provides useful insights on the impact of approximation on execution time and energy consumption. With ApproxPICO we propose an extension to PICO to consider approximate computing during the parallelization of sequential applications