35,230 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture
There is interest in exploring hybrid OpenSHMEM + X programming models to
extend the applicability of the OpenSHMEM interface to more hardware
architectures. We present a hybrid OpenCL + OpenSHMEM programming model for
device-level programming for architectures like the Adapteva Epiphany many-core
RISC array processor. The Epiphany architecture comprises a 2D array of
low-power RISC cores with minimal uncore functionality connected by a 2D mesh
Network-on-Chip (NoC). The Epiphany architecture offers high computational
energy efficiency for integer and floating point calculations as well as
parallel scalability. The Epiphany-III is available as a coprocessor in
platforms that also utilize an ARM CPU host. OpenCL provides good functionality
for supporting a co-design programming model in which the host CPU offloads
parallel work to a coprocessor. However, the OpenCL memory model is
inconsistent with the Epiphany memory architecture and lacks support for
inter-core communication. We propose a hybrid programming model in which
OpenSHMEM provides a better solution by replacing the non-standard OpenCL
extensions introduced to achieve high performance with the Epiphany
architecture. We demonstrate the proposed programming model for matrix-matrix
multiplication based on Cannon's algorithm showing that the hybrid model
addresses the deficiencies of using OpenCL alone to achieve good benchmark
performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
Neuro-memristive Circuits for Edge Computing: A review
The volume, veracity, variability, and velocity of data produced from the
ever-increasing network of sensors connected to Internet pose challenges for
power management, scalability, and sustainability of cloud computing
infrastructure. Increasing the data processing capability of edge computing
devices at lower power requirements can reduce several overheads for cloud
computing solutions. This paper provides the review of neuromorphic
CMOS-memristive architectures that can be integrated into edge computing
devices. We discuss why the neuromorphic architectures are useful for edge
devices and show the advantages, drawbacks and open problems in the field of
neuro-memristive circuits for edge computing
Event-based Backpropagation for Analog Neuromorphic Hardware
Neuromorphic computing aims to incorporate lessons from studying biological
nervous systems in the design of computer architectures. While existing
approaches have successfully implemented aspects of those computational
principles, such as sparse spike-based computation, event-based scalable
learning has remained an elusive goal in large-scale systems. However, only
then the potential energy-efficiency advantages of neuromorphic systems
relative to other hardware architectures can be realized during learning. We
present our progress implementing the EventProp algorithm using the example of
the BrainScaleS-2 analog neuromorphic hardware. Previous gradient-based
approaches to learning used "surrogate gradients" and dense sampling of
observables or were limited by assumptions on the underlying dynamics and loss
functions. In contrast, our approach only needs spike time observations from
the system while being able to incorporate other system observables, such as
membrane voltage measurements, in a principled way. This leads to a
one-order-of-magnitude improvement in the information efficiency of the
gradient estimate, which would directly translate to corresponding energy
efficiency improvements in an optimized hardware implementation. We present the
theoretical framework for estimating gradients and results verifying the
correctness of the estimation, as well as results on a low-dimensional
classification task using the BrainScaleS-2 system. Building on this work has
the potential to enable scalable gradient estimation in large-scale
neuromorphic hardware as a continuous measurement of the system state would be
prohibitive and energy-inefficient in such instances. It also suggests the
feasibility of a full on-device implementation of the algorithm that would
enable scalable, energy-efficient, event-based learning in large-scale analog
neuromorphic hardware
- …