6,417 research outputs found
Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems
The rising use of deep learning and other big-data algorithms has led to an
increasing demand for hardware platforms that are computationally powerful, yet
energy-efficient. Due to the amount of data parallelism in these algorithms,
high-performance 3D manycore platforms that incorporate both CPUs and GPUs
present a promising direction. However, as systems use heterogeneity (e.g., a
combination of CPUs, GPUs, and accelerators) to improve performance and
efficiency, it becomes more pertinent to address the distinct and likely
conflicting communication requirements (e.g., CPU memory access latency or GPU
network throughput) that arise from such heterogeneity. Unfortunately, it is
difficult to quickly explore the hardware design space and choose appropriate
tradeoffs between these heterogeneous requirements. To address these
challenges, we propose the design of a 3D Network-on-Chip (NoC) for
heterogeneous manycore platforms that considers the appropriate design
objectives for a 3D heterogeneous system and explores various tradeoffs using
an efficient ML-based multi-objective optimization technique. The proposed
design space exploration considers the various requirements of its
heterogeneous components and generates a set of 3D NoC architectures that
efficiently trades off these design objectives. Our findings show that by
jointly considering these requirements (latency, throughput, temperature, and
energy), we can achieve 9.6% better Energy-Delay Product on average at nearly
iso-temperature conditions when compared to a thermally-optimized design for 3D
heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs
optimized for a few applications can be generalized for unknown applications as
well. Our results show that these generalized 3D NoCs only incur a 1.8%
(36-tile system) and 1.1% (64-tile system) average performance loss compared to
application-specific NoCs.Comment: Published in IEEE Transactions on Computer
Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework
Machine learning applied to architecture design presents a promising
opportunity with broad applications. Recent deep reinforcement learning (DRL)
techniques, in particular, enable efficient exploration in vast design spaces
where conventional design strategies may be inadequate. This paper proposes a
novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as
an evaluation case study. The new framework successfully resolves problems with
prior design approaches being either unreliable due to random searches or
inflexible due to severe design space restrictions. The framework learns
(near-)optimal loop placement for routerless NoCs with various design
constraints. A deep neural network is developed using parallel threads that
efficiently explore the immense routerless NoC design space with a Monte Carlo
search tree. Experimental results show that, compared with conventional mesh,
the proposed deep reinforcement learning (DRL) routerless design achieves a
3.25x increase in throughput, 1.6x reduction in packet latency, and 5x
reduction in power. Compared with the state-of-the-art routerless NoC, DRL
achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and
1.14x reduction in average hop count albeit with slightly more power overhead.Comment: 13 pages, 15 figure
Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms
When a computational task tolerates a relaxation of its specification or when
an algorithm tolerates the effects of noise in its execution, hardware,
programming languages, and system software can trade deviations from correct
behavior for lower resource usage. We present, for the first time, a synthesis
of research results on computing systems that only make as many errors as their
users can tolerate, from across the disciplines of computer aided design of
circuits, digital system design, computer architecture, programming languages,
operating systems, and information theory.
Rather than over-provisioning resources at each layer to avoid errors, it can
be more efficient to exploit the masking of errors occurring at one layer which
can prevent them from propagating to a higher layer. We survey tradeoffs for
individual layers of computing systems from the circuit level to the operating
system level and illustrate the potential benefits of end-to-end approaches
using two illustrative examples. To tie together the survey, we present a
consistent formalization of terminology, across the layers, which does not
significantly deviate from the terminology traditionally used by research
communities in their layer of focus.Comment: 35 page
FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge
While embedded FPGAs are attractive platforms for DNN acceleration on
edge-devices due to their low latency and high energy efficiency, the scarcity
of resources of edge-scale FPGA devices also makes it challenging for DNN
deployment. In this paper, we propose a simultaneous FPGA/DNN co-design
methodology with both bottom-up and top-down approaches: a bottom-up
hardware-oriented DNN model search for high accuracy, and a top-down FPGA
accelerator design considering DNN-specific characteristics. We also build an
automatic co-design flow, including an Auto-DNN engine to perform
hardware-oriented DNN model search, as well as an Auto-HLS engine to generate
synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate
our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results
show that our proposed DNN model and accelerator outperform the
state-of-the-art FPGA designs in all aspects including Intersection-over-Union
(IoU) (6.2% higher), frames per second (FPS) (2.48X higher), power consumption
(40% lower), and energy efficiency (2.5X higher). Compared to GPU-based
solutions, our designs deliver similar accuracy but consume far less energy.Comment: Accepted by Design Automation Conference (DAC'2019
Hardware-Aware Machine Learning: Modeling and Optimization
Recent breakthroughs in Deep Learning (DL) applications have made DL models a
key component in almost every modern computing system. The increased popularity
of DL applications deployed on a wide-spectrum of platforms have resulted in a
plethora of design challenges related to the constraints introduced by the
hardware itself. What is the latency or energy cost for an inference made by a
Deep Neural Network (DNN)? Is it possible to predict this latency or energy
consumption before a model is trained? If yes, how can machine learners take
advantage of these models to design the hardware-optimal DNN for deployment?
From lengthening battery life of mobile devices to reducing the runtime
requirements of DL models executing in the cloud, the answers to these
questions have drawn significant attention.
One cannot optimize what isn't properly modeled. Therefore, it is important
to understand the hardware efficiency of DL models during serving for making an
inference, before even training the model. This key observation has motivated
the use of predictive models to capture the hardware performance or energy
efficiency of DL applications. Furthermore, DL practitioners are challenged
with the task of designing the DNN model, i.e., of tuning the hyper-parameters
of the DNN architecture, while optimizing for both accuracy of the DL model and
its hardware efficiency. Therefore, state-of-the-art methodologies have
proposed hardware-aware hyper-parameter optimization techniques. In this paper,
we provide a comprehensive assessment of state-of-the-art work and selected
results on the hardware-aware modeling and optimization for DL applications. We
also highlight several open questions that are poised to give rise to novel
hardware-aware designs in the next few years, as DL applications continue to
significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape
Scalability of broadcast performance in wireless network-on-chip
Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
Symmetry in Software Synthesis
With the surge of multi- and manycores, much research has focused on
algorithms for mapping and scheduling on these complex platforms. Large classes
of these algorithms face scalability problems. This is why diverse methods are
commonly used for reducing the search space. While most such approaches
leverage the inherent symmetry of architectures and applications, they do it in
a problem-specific and intuitive way. However, intuitive approaches become
impractical with growing hardware complexity, like Network-on-Chip interconnect
or heterogeneous cores. In this paper, we present a formal framework that can
determine the inherent symmetry of architectures and applications
algorithmically and leverage these for problems in software synthesis. Our
approach is based on the mathematical theory of groups and a generalization
called inverse semigroups. We evaluate our approach in two state-of-the-art
mapping frameworks. Even for the platforms with a handful of cores of today and
moderate-size benchmarks, our approach consistently yields reductions of the
overall execution time of algorithms, accelerating them by a factor up to 10 in
our experiments, or improving the quality of the results.Comment: 31 pages, 18 figure
Machine Learning and Manycore Systems Design: A Serendipitous Symbiosis
Tight collaboration between experts of machine learning and manycore system
design is necessary to create a data-driven manycore design framework that
integrates both learning and expert knowledge. Such a framework will be
necessary to address the rising complexity of designing large-scale manycore
systems and machine learning techniques.Comment: To appear in a future publication of IEEE Compute
White Paper on Critical and Massive Machine Type Communication Towards 6G
The society as a whole, and many vertical sectors in particular, is becoming
increasingly digitalized. Machine Type Communication (MTC), encompassing its
massive and critical aspects, and ubiquitous wireless connectivity are among
the main enablers of such digitization at large. The recently introduced 5G New
Radio is natively designed to support both aspects of MTC to promote the
digital transformation of the society. However, it is evident that some of the
more demanding requirements cannot be fully supported by 5G networks.
Alongside, further development of the society towards 2030 will give rise to
new and more stringent requirements on wireless connectivity in general, and
MTC in particular. Driven by the societal trends towards 2030, the next
generation (6G) will be an agile and efficient convergent network serving a set
of diverse service classes and a wide range of key performance indicators
(KPI). This white paper explores the main drivers and requirements of an
MTC-optimized 6G network, and discusses the following six key research
questions:
- Will the main KPIs of 5G continue to be the dominant KPIs in 6G; or will
there emerge new key metrics?
- How to deliver different E2E service mandates with different KPI
requirements considering joint-optimization at the physical up to the
application layer?
- What are the key enablers towards designing ultra-low power receivers and
highly efficient sleep modes?
- How to tackle a disruptive rather than incremental joint design of a
massively scalable waveform and medium access policy for global MTC
connectivity?
- How to support new service classes characterizing mission-critical and
dependable MTC in 6G?
- What are the potential enablers of long term, lightweight and flexible
privacy and security schemes considering MTC device requirements?Comment: White paper by http://www.6GFlagship.co
Distributed optimization in wireless sensor networks: an island-model framework
Wireless Sensor Networks (WSNs) is an emerging technology in several
application domains, ranging from urban surveillance to environmental and
structural monitoring. Computational Intelligence (CI) techniques are
particularly suitable for enhancing these systems. However, when embedding CI
into wireless sensors, severe hardware limitations must be taken into account.
In this paper we investigate the possibility to perform an online, distributed
optimization process within a WSN. Such a system might be used, for example, to
implement advanced network features like distributed modelling, self-optimizing
protocols, and anomaly detection, to name a few. The proposed approach, called
DOWSN (Distributed Optimization for WSN) is an island-model infrastructure in
which each node executes a simple, computationally cheap (both in terms of CPU
and memory) optimization algorithm, and shares promising solutions with its
neighbors. We perform extensive tests of different DOWSN configurations on a
benchmark made up of continuous optimization problems; we analyze the influence
of the network parameters (number of nodes, inter-node communication period and
probability of accepting incoming solutions) on the optimization performance.
Finally, we profile energy and memory consumption of DOWSN to show the
efficient usage of the limited hardware resources available on the sensor
nodes
- …