Search CORE

6,417 research outputs found

Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Author: Doppa Janardhan Rao
Joardar Biresh Kumar
Kim Ryan Gary
Marculescu Diana
Marculescu Radu
Pande Partha Pratim
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/10/2019
Field of study

The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance 3D manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient ML-based multi-objective optimization technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6% better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8% (36-tile system) and 1.1% (64-tile system) average performance loss compared to application-specific NoCs.Comment: Published in IEEE Transactions on Computer

arXiv.org e-Print Archive

Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework

Author: Chen Lizhong
Lin Ting-Ru
Pedram Massoud
Penney Drew
Publication venue
Publication date: 10/05/2019
Field of study

Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.Comment: 13 pages, 15 figure

arXiv.org e-Print Archive

Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms

Author: Alaghi Armin
Cacciotti Mattia
Carbin Michael
Daglis Alexandros
Darulova Eva
Dolecek Lara
Falsafi Babak
Gerstlauer Andreas
Gillani Ghayoor
Jerger Natalie Enright
Jevdjic Djordje
Misailovic Sasa
Moreau Thierry
Sampson Adrian
Stanley-Marbell Phillip
Zufferey Damien
Publication venue
Publication date: 16/09/2018
Field of study

When a computational task tolerates a relaxation of its specification or when an algorithm tolerates the effects of noise in its execution, hardware, programming languages, and system software can trade deviations from correct behavior for lower resource usage. We present, for the first time, a synthesis of research results on computing systems that only make as many errors as their users can tolerate, from across the disciplines of computer aided design of circuits, digital system design, computer architecture, programming languages, operating systems, and information theory. Rather than over-provisioning resources at each layer to avoid errors, it can be more efficient to exploit the masking of errors occurring at one layer which can prevent them from propagating to a higher layer. We survey tradeoffs for individual layers of computing systems from the circuit level to the operating system level and illustrate the potential benefits of end-to-end approaches using two illustrative examples. To tie together the survey, we present a consistent formalization of terminology, across the layers, which does not significantly deviate from the terminology traditionally used by research communities in their layer of focus.Comment: 35 page

arXiv.org e-Print Archive

FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge

Author: Chen Deming
Hao Cong
Huang Sitao
Hwu Wen-mei
Li Yuhong
Rupnow Kyle
Xiong Jinjun
Zhang Xiaofan
Publication venue
Publication date: 08/04/2019
Field of study

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up and top-down approaches: a bottom-up hardware-oriented DNN model search for high accuracy, and a top-down FPGA accelerator design considering DNN-specific characteristics. We also build an automatic co-design flow, including an Auto-DNN engine to perform hardware-oriented DNN model search, as well as an Auto-HLS engine to generate synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results show that our proposed DNN model and accelerator outperform the state-of-the-art FPGA designs in all aspects including Intersection-over-Union (IoU) (6.2% higher), frames per second (FPS) (2.48X higher), power consumption (40% lower), and energy efficiency (2.5X higher). Compared to GPU-based solutions, our designs deliver similar accuracy but consume far less energy.Comment: Accepted by Design Automation Conference (DAC'2019

arXiv.org e-Print Archive

Hardware-Aware Machine Learning: Modeling and Optimization

Author: Cai Ermao
Marculescu Diana
Stamoulis Dimitrios
Publication venue
Publication date: 14/09/2018
Field of study

Recent breakthroughs in Deep Learning (DL) applications have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)? Is it possible to predict this latency or energy consumption before a model is trained? If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment? From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention. One cannot optimize what isn't properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of DL applications. Furthermore, DL practitioners are challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardware-aware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for DL applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.Comment: ICCAD'18 Invited Pape

arXiv.org e-Print Archive

Scalability of broadcast performance in wireless network-on-chip

Author: Abadal Cavallé Sergi
Alarcón Cot Eduardo José
Cabellos Aparicio Alberto
González Colás Antonio María
Lee Heekwan
Mestres Sugrañes Albert
Nemirovsky Mario
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

Symmetry in Software Synthesis

Author: Castrillon Jeronimo
Goens Andrés
Siccha Sergio
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/04/2017
Field of study

With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this paper, we present a formal framework that can determine the inherent symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-size benchmarks, our approach consistently yields reductions of the overall execution time of algorithms, accelerating them by a factor up to 10 in our experiments, or improving the quality of the results.Comment: 31 pages, 18 figure

arXiv.org e-Print Archive

Machine Learning and Manycore Systems Design: A Serendipitous Symbiosis

Author: Doppa Janardhan Rao
Kim Ryan Gary
Marculescu Diana
Marculescu Radu
Pande Partha Pratim
Publication venue
Publication date: 30/11/2017
Field of study

Tight collaboration between experts of machine learning and manycore system design is necessary to create a data-driven manycore design framework that integrates both learning and expert knowledge. Such a framework will be necessary to address the rising complexity of designing large-scale manycore systems and machine learning techniques.Comment: To appear in a future publication of IEEE Compute

arXiv.org e-Print Archive

White Paper on Critical and Massive Machine Type Communication Towards 6G

The society as a whole, and many vertical sectors in particular, is becoming increasingly digitalized. Machine Type Communication (MTC), encompassing its massive and critical aspects, and ubiquitous wireless connectivity are among the main enablers of such digitization at large. The recently introduced 5G New Radio is natively designed to support both aspects of MTC to promote the digital transformation of the society. However, it is evident that some of the more demanding requirements cannot be fully supported by 5G networks. Alongside, further development of the society towards 2030 will give rise to new and more stringent requirements on wireless connectivity in general, and MTC in particular. Driven by the societal trends towards 2030, the next generation (6G) will be an agile and efficient convergent network serving a set of diverse service classes and a wide range of key performance indicators (KPI). This white paper explores the main drivers and requirements of an MTC-optimized 6G network, and discusses the following six key research questions: - Will the main KPIs of 5G continue to be the dominant KPIs in 6G; or will there emerge new key metrics? - How to deliver different E2E service mandates with different KPI requirements considering joint-optimization at the physical up to the application layer? - What are the key enablers towards designing ultra-low power receivers and highly efficient sleep modes? - How to tackle a disruptive rather than incremental joint design of a massively scalable waveform and medium access policy for global MTC connectivity? - How to support new service classes characterizing mission-critical and dependable MTC in 6G? - What are the potential enablers of long term, lightweight and flexible privacy and security schemes considering MTC device requirements?Comment: White paper by http://www.6GFlagship.co

arXiv.org e-Print Archive

Distributed optimization in wireless sensor networks: an island-model framework

Author: Iacca Giovanni
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/10/2018
Field of study

Wireless Sensor Networks (WSNs) is an emerging technology in several application domains, ranging from urban surveillance to environmental and structural monitoring. Computational Intelligence (CI) techniques are particularly suitable for enhancing these systems. However, when embedding CI into wireless sensors, severe hardware limitations must be taken into account. In this paper we investigate the possibility to perform an online, distributed optimization process within a WSN. Such a system might be used, for example, to implement advanced network features like distributed modelling, self-optimizing protocols, and anomaly detection, to name a few. The proposed approach, called DOWSN (Distributed Optimization for WSN) is an island-model infrastructure in which each node executes a simple, computationally cheap (both in terms of CPU and memory) optimization algorithm, and shares promising solutions with its neighbors. We perform extensive tests of different DOWSN configurations on a benchmark made up of continuous optimization problems; we analyze the influence of the network parameters (number of nodes, inter-node communication period and probability of accepting incoming solutions) on the optimization performance. Finally, we profile energy and memory consumption of DOWSN to show the efficient usage of the limited hardware resources available on the sensor nodes

arXiv.org e-Print Archive