8,926 research outputs found
Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing
Hierarchical ring networks, which hierarchically connect multiple levels of
rings, have been proposed in the past to improve the scalability of ring
interconnects, but past hierarchical ring designs sacrifice some of the key
benefits of rings by introducing more complex in-ring buffering and buffered
flow control. Our goal in this paper is to design a new hierarchical ring
interconnect that can maintain most of the simplicity of traditional ring
designs (no in-ring buffering or buffered flow control) while achieving high
scalability as more complex buffered hierarchical ring designs. Our design,
called HiRD (Hierarchical Rings with Deflection), includes features that allow
us to mostly maintain the simplicity of traditional simple ring topologies
while providing higher energy efficiency and scalability. First, HiRD does not
have any buffering or buffered flow control within individual rings, and
requires only a small amount of buffering between the ring hierarchy levels.
When inter-ring buffers are full, our design simply deflects flits so that they
circle the ring and try again, which eliminates the need for in-ring buffering.
Second, we introduce two simple mechanisms that provides an end-to-end delivery
guarantee within the entire network without impacting the critical path or
latency of the vast majority of network traffic. HiRD attains equal or better
performance at better energy efficiency than multiple versions of both a
previous hierarchical ring design and a traditional single ring design. We also
analyze our design's characteristics and injection and delivery guarantees. We
conclude that HiRD can be a compelling design point that allows higher energy
efficiency and scalability while retaining the simplicity and appeal of
conventional ring-based designs
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Deep neural networks (DNNs) are currently widely used for many artificial
intelligence (AI) applications including computer vision, speech recognition,
and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it
comes at the cost of high computational complexity. Accordingly, techniques
that enable efficient processing of DNNs to improve energy efficiency and
throughput without sacrificing application accuracy or increasing hardware cost
are critical to the wide deployment of DNNs in AI systems.
This article aims to provide a comprehensive tutorial and survey about the
recent advances towards the goal of enabling efficient processing of DNNs.
Specifically, it will provide an overview of DNNs, discuss various hardware
platforms and architectures that support DNNs, and highlight key trends in
reducing the computation cost of DNNs either solely via hardware design changes
or via joint hardware design and DNN algorithm changes. It will also summarize
various development resources that enable researchers and practitioners to
quickly get started in this field, and highlight important benchmarking metrics
and design considerations that should be used for evaluating the rapidly
growing number of DNN hardware designs, optionally including algorithmic
co-designs, being proposed in academia and industry.
The reader will take away the following concepts from this article:
understand the key design considerations for DNNs; be able to evaluate
different DNN hardware implementations with benchmarks and comparison metrics;
understand the trade-offs between various hardware architectures and platforms;
be able to evaluate the utility of various DNN design techniques for efficient
processing; and understand recent implementation trends and opportunities.Comment: Based on tutorial on DNN Hardware at eyeriss.mit.edu/tutorial.htm
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
In-DRAM Bulk Bitwise Execution Engine
Many applications heavily use bitwise operations on large bitvectors as part
of their computation. In existing systems, performing such bulk bitwise
operations requires the processor to transfer a large amount of data on the
memory channel, thereby consuming high latency, memory bandwidth, and energy.
In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk
bitwise operations completely inside main memory. Ambit exploits the internal
organization and analog operation of DRAM-based memory to achieve low cost,
high performance, and low energy. Ambit exposes a new bulk bitwise execution
model to the host processor. Evaluations show that Ambit significantly improves
the performance of several applications that use bulk bitwise operations,
including databases.Comment: arXiv admin note: substantial text overlap with arXiv:1605.06483,
arXiv:1610.09603, arXiv:1611.0998
CODA: Enabling Co-location of Computation and Data for Near-Data Processing
Recent studies have demonstrated that near-data processing (NDP) is an
effective technique for improving performance and energy efficiency of
data-intensive workloads. However, leveraging NDP in realistic systems with
multiple memory modules introduces a new challenge. In today's systems, where
no computation occurs in memory modules, the physical address space is
interleaved at a fine granularity among all memory modules to help improve the
utilization of processor-memory interfaces by distributing the memory traffic.
However, this is at odds with efficient use of NDP, which requires careful
placement of data in memory modules such that near-data computations and their
exclusively used data can be localized in individual memory modules, while
distributing shared data among memory modules to reduce hotspots. In order to
address this new challenge, we propose a set of techniques that (1) enable
collections of OS pages to either be fine-grain interleaved among memory
modules (as is done today) or to be placed contiguously on individual memory
modules (as is desirable for NDP private data), and (2) decide whether to
localize or distribute each memory object based on its anticipated access
pattern and steer computations to the memory where the data they access is
located. Our evaluations across a wide range of workloads show that the
proposed mechanism improves performance by 31% and reduces 38% remote data
accesses over a baseline system that cannot exploit computate-data affinity
characteristics.Comment: 14 pages, 16 figure
Ring-Mesh: A Scalable and High-Performance Approach for Manycore Accelerators
There are increasing number of works addressing the design challenges of
fast, scalable solutions for the growing number of new type of applications.
Recently, many of the solutions aimed at improving processing element
capabilities to speed up the execution of machine learning application domain.
However, only a few works focused on the interconnection subsystem as a
potential source of performance improvement. Wrapping many cores together offer
excellent parallelism, but it brings other challenges (e.g., adequate
interconnections). Scalable, power-aware interconnects are required to support
such a growing number of processing elements, as well as modern applications.
In this paper, we propose a scalable and energy efficient Network-on-Chip
architecture fusing the advantages of rings as well as the 2D-mesh without
using any bridge router to provide high-performance. A dynamic adaptation
mechanism allows to better adapt to the application requirements. Simulation
results show efficient power consumption (up to 141.3% saving for connecting
1024 cores), 2x (on average) throughput growth with better scalability (up to
1024 processing elements) compared to popular 2D-mesh while tested in multiple
statistical traffic pattern scenarios.Comment: 35 pages, Accepted to Journal of Supercomputin
Evolutionary Optimisation of Real-Time Systems and Networks
The design space of networked embedded systems is very large, posing
challenges to the optimisation of such platforms when it comes to support
applications with real-time guarantees. Recent research has shown that a number
of inter-related optimisation problems have a critical influence over the
schedulability of a system, i.e. whether all its application components can
execute and communicate by their respective deadlines. Examples of such
optimization problems include task allocation and scheduling, communication
routing and arbitration, memory allocation, and voltage and frequency scaling.
In this paper, we advocate the use of evolutionary approaches to address such
optimization problems, aiming to evolve individuals of increased fitness over
multiple generations of potential solutions. We refer to plentiful evidence
that existing real-time schedulability tests can be used effectively to guide
evolutionary optimisation, either by themselves or in combination with other
metrics such as energy dissipation or hardware overheads. We then push that
concept one step further and consider the possibility of using evolutionary
techniques to evolve the schedulability tests themselves, aiming to support the
verification and optimisation of systems which are too complex for
state-of-the-art (manual) derivation of schedulability tests
Simulation Environment for Link Energy Estimation in Networks-on-Chip with Virtual Channels
Network-on-chip (NoC) is the most promising design paradigm for the
interconnect architecture of a multiprocessor system-on-chip (MPSoC). On the
downside, a NoC has a significant impact on the overall energy consumption of
the system. NoC simulators are highly relevant for design space exploration
even at an early stage. Since links in NoC consume up to 50% of the energy, a
realistic energy consumption of links in NoC simulators is important. This work
presents a simulation environment which implements a technique to precisely
estimate the data dependent link energy consumption in NoCs with virtual
channels for the first time. Our model works at a high level of abstraction,
making it feasible to estimate the energy requirements at an early design
stage. Additionally, it enables the fast evaluation and early exploration of
low-power coding techniques. The presented model is applicable for 2D and 3D
NoCs. A case study for an image processing application shows that the current
link model leads to an underestimate of the link energy consumption by up to a
factor of four. In contrast, the technique presented in this paper estimates
the energy quantities precisely with an error below 1% compared to results
obtained by precise, but computational extensive, bit-level simulation
Estimation of Optimized Energy and Latency Constraints for Task Allocation in 3d Network on Chip
In Network on Chip (NoC) rooted system, energy consumption is affected by
task scheduling and allocation schemes which affect the performance of the
system. In this paper we test the pre-existing proposed algorithms and
introduced a new energy skilled algorithm for 3D NoC architecture. An efficient
dynamic and cluster approaches are proposed along with the optimization using
bio-inspired algorithm. The proposed algorithm has been implemented and
evaluated on randomly generated benchmark and real life application such as
MMS, Telecom and VOPD. The algorithm has also been tested with the E3S
benchmark and has been compared with the existing mapping algorithm spiral and
crinkle and has shown better reduction in the communication energy consumption
and shows improvement in the performance of the system. On performing
experimental analysis of proposed algorithm results shows that average
reduction in energy consumption is 49%, reduction in communication cost is 48%
and average latency is 34%. Cluster based approach is mapped onto NoC using
Dynamic Diagonal Mapping (DDMap), Crinkle and Spiral algorithms and found DDmap
provides improved result. On analysis and comparison of mapping of cluster
using DDmap approach the average energy reduction is 14% and 9% with crinkle
and spiral.Comment: 20 Pages,17 Figure, International Journal of Computer Science &
Information Technology. arXiv admin note: substantial text overlap with
arXiv:1404.251
Energy and Latency Aware Application Mapping Algorithm & Optimization for Homogeneous 3D Network on Chip
Energy efficiency is one of the most critical issue in design of System on
Chip. In Network On Chip (NoC) based system, energy consumption is influenced
dramatically by mapping of Intellectual Property (IP) which affect the
performance of the system. In this paper we test the antecedently extant
proposed algorithms and introduced a new energy proficient algorithm stand for
3D NoC architecture. In addition a hybrid method has also been implemented
using bioinspired optimization (particle swarm optimization) technique. The
proposed algorithm has been implemented and evaluated on randomly generated
benchmark and real life application such as MMS, Telecom and VOPD. The
algorithm has also been tested with the E3S benchmark and has been compared
with the existing algorithm (spiral and crinkle) and has shown better reduction
in the communication energy consumption and shows improvement in the
performance of the system. Comparing our work with spiral and crinkle,
experimental result shows that the average reduction in communication energy
consumption is 19% with spiral and 17% with crinkle mapping algorithms, while
reduction in communication cost is 24% and 21% whereas reduction in latency is
of 24% and 22% with spiral and crinkle. Optimizing our work and the existing
methods using bio-inspired technique and having the comparison among them an
average energy reduction is found to be of 18% and 24%.Comment: 15 pages, 11 figure, CCSEA 201
- …