591 research outputs found
LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft
Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application
New challenges in Astronomy and Astrophysics (AA) are urging the need for a
large number of exceptionally computationally intensive simulations. "Exascale"
(and beyond) computational facilities are mandatory to address the size of
theoretical problems and data coming from the new generation of observational
facilities in AA. Currently, the High Performance Computing (HPC) sector is
undergoing a profound phase of innovation, in which the primary challenge to
the achievement of the "Exascale" is the power-consumption. The goal of this
work is to give some insights about performance and energy footprint of
contemporary architectures for a real astrophysical application in an HPC
context. We use a state-of-the-art N-body application that we re-engineered and
optimized to exploit the heterogeneous underlying hardware fully. We
quantitatively evaluate the impact of computation on energy consumption when
running on four different platforms. Two of them represent the current HPC
systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster
based on ARM-MPSoC, and one is a "prototype towards Exascale" equipped with
ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the
different devices where the high-end GPUs excel in terms of time-to-solution
while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience
reveals that considering FPGAs for computationally intensive application seems
very promising, as their performance is improving to meet the requirements of
scientific applications. This work can be a reference for future platforms
development for astrophysics applications where computationally intensive
calculations are required.Comment: 15 pages, 4 figures, 3 tables; Preprint (V2) submitted to MDPI
(Special Issue: Energy-Efficient Computing on Parallel Architectures
Agile SoC Development with Open ESP
ESP is an open-source research platform for heterogeneous SoC design. The
platform combines a modular tile-based architecture with a variety of
application-oriented flows for the design and optimization of accelerators. The
ESP architecture is highly scalable and strikes a balance between regularity
and specialization. The companion methodology raises the level of abstraction
to system-level design and enables an automated flow from software and hardware
development to full-system prototyping on FPGA. For application developers, ESP
offers domain-specific automated solutions to synthesize new accelerators for
their software and to map complex workloads onto the SoC architecture. For
hardware engineers, ESP offers automated solutions to integrate their
accelerator designs into the complete SoC. Conceived as a heterogeneous
integration platform and tested through years of teaching at Columbia
University, ESP supports the open-source hardware community by providing a
flexible platform for agile SoC development.Comment: Invited Paper at the 2020 International Conference On Computer Aided
Design (ICCAD) - Special Session on Opensource Tools and Platforms for Agile
Development of Specialized Architecture
Quantifying the latency benefits of near-edge and in-network FPGA acceleration
Transmitting data to cloud datacenters in distributed IoT applications introduces significant communication latency, but is often the only feasible solution when source nodes are computationally limited. To address latency concerns, cloudlets, in-network computing, and more capable edge nodes are all being explored as a way of moving processing capability towards the edge of the network. Hardware acceleration using Field Programmable Gate Arrays (FPGAs) is also seeing increased interest due to reduced computation latency and improved efficiency. This paper evaluates the the implications of these offloading approaches using a case study neural network based image classification application, quantifying both the computation and communication latency resulting from different platform choices. We consider communication latency including the ingestion of packets for processing on the target platform, showing that this varies significantly with the choice of platform. We demonstrate that emerging in-network accelerator approaches offer much improved and predictable performance as well as better scaling to support multiple data sources
EIE: Efficient Inference Engine on Compressed Deep Neural Network
State-of-the-art deep neural networks (DNNs) have hundreds of millions of
connections and are both computationally and memory intensive, making them
difficult to deploy on embedded systems with limited hardware resources and
power budgets. While custom hardware helps the computation, fetching weights
from DRAM is two orders of magnitude more expensive than ALU operations, and
dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs
(AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by
pruning the redundant connections and having multiple connections share the
same weight. We propose an energy efficient inference engine (EIE) that
performs inference on this compressed network model and accelerates the
resulting sparse matrix-vector multiplication with weight sharing. Going from
DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x;
Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x.
Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to
CPU and GPU implementations of the same DNN without compression. EIE has a
processing power of 102GOPS/s working directly on a compressed network,
corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of
AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is
24,000x and 3,400x more energy efficient than a CPU and GPU respectively.
Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy
efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly:
https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision:
http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at
Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University:
https://goo.gl/6lwuer. Published as a conference paper in ISCA 201
- …