Search CORE

112 research outputs found

DDRNoC: Dual Data-Rate Network-on-Chip

Author: Ejaz Ahsen
Publication venue
Publication date: 01/01/2018
Field of study

Networks-on-Chip (NoCs) are becoming increasing important for the performance of modern multi-core system-on-chip. For various on-chip networks with virtual channel (VC) ow control, the slow control logic (VC and switch allocation logic) of the NoC routers limits the NoC clock period while their datapath (switch and link) possesses signifcant slack. This slack results in wasted performance potential of the datapath, limits the saturation throughput of the network and reduces its energy efficiency. The aim of this thesis is to improve NoC performance by eliminating this slack and removing control logic from the router critical path. To this end, this thesis presents the Dual Data-Rate (DDR) network architecture called the DDRNoC. It utilizes the NoC datapath twice with in a clock cycle to forward its at DDR. This not only exploits the slack present in the datapath but also requires a clock with period twice the datapath delay, thus removing the shorter control logic from the critical path. This enables the DDRNoC to achieve throughput higher than single data-rate networks. Moreover, the DDRNoC also employs lookahead signalling to reduce end-to-end packet latency. FreewayNoC, an extension to the DDRNoC supplements the DDRNoC with simplified pipeline stage bypassing to reduce the zero-load latency of packets in the network. Implementation of the DDRNoC and FreewayNoC architectures require redesign of the switch allocation (SA) mechanism to resolve contention among competing its by granting up to two its access to each switch input and output port per clock cycle. It further requires separate paths for the propagation of lookahead control signals. FreewayNoC also requires implementation of multiple checks to guarantee con ict-free bypassing of the SA stage. Physical implementation results using 28nm process technology show that DDRNoC and FreewayNoC have 5% and 15% area overhead, respectively, compared to a simple 3-stage network with VCs. Performance evaluation shows that for a 16X16 mesh network, FreewayNoC supports 25% higher throughput compared to current state-of-the-art NoC, ShortPath. Moreover, FreewayNoC achieves a zero-load latency which scales better than ShortPath and equally well with an ideal network that has no control overheads. For application driven traffic, FreewayNoC reduces average packet latency by 18% compared to ShortPath. Alternatively, low voltage implementation of the DDRNoC and FreewayNoC can be used to conserve power and improve energy efficiency at the cost of higher packet latency

Chalmers Research

Recommended from our members

Network-on-Chip Synchronization

Author: Buckler Mark
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

Technology scaling has enabled the number of cores within a System on Chip (SoC) to increase significantly. Globally Asynchronous Locally Synchronous (GALS) systems using Dynamic Voltage and Frequency Scaling (DVFS) operate each of these cores on distinct and dynamic clock domains. The main communication method between these cores is increasingly more likely to be a Network-on-Chip (NoC). Typically, the interfaces between these clock domains experience multi-cycle synchronization latencies due to their use of “brute-force” synchronizers. This dissertation aims to improve the performance of NoCs and thereby SoCs as a whole by reducing this synchronization latency. First, a survey of NoC improvement techniques is presented. One such improvement technique: a multi-layer NoC, has been successfully simulated. Given how one of the most commonly used techniques is DVFS, a thorough analysis and simulation of brute-force synchronizer circuits in both current and future process technologies is presented. Unfortunately, a multi-cycle latency is unavoidable when using brute-force synchronizers, so predictive synchronizers which require only a single cycle of latency have been proposed. To demonstrate the impact of these predictive synchronizer circuits at a high level, multi-core system simulations incorporating these circuits have been completed. Multiple forms of GALS NoC configurations have been simulated, including multi-synchronous, NoC-synchronous, and single-synchronizer. Speedup on the SPLASH benchmark suite was measured to directly quantify the performance benefit of predictive synchronizers in a full system. Additionally, Mean Time Between Failures (MTBF) has been calculated for each NoC synchronizer configuration to determine the reliability benefit possible when using predictive synchronizers

ScholarWorks@UMass Amherst

A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip

Author: Folie Simon
Karavadara Nilesh
Kirner Raimund
Nguyen Vu Thien Nga
Zolda Michael
Publication venue
Publication date: 01/03/2014
Field of study

Nilesh Karavadara, Simon Folie, Michael Zolda, Vu Thien Nga Nguyen, Raimund Kirner, 'A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip'. Paper presented at the Int'l Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES'14), Dresden, Germany, 24-28 March 2014.Software developers are discovering that practices which have successfully served single-core platforms for decades do no longer work for multi-cores. Stream processing is a parallel execution model that is well-suited for architectures with multiple computational elements that are connected by a network. We propose a power-aware streaming execution layer for network-on-chip architectures that addresses the energy constraints of embedded devices. Our proof-of-concept implementation targets the Intel SCC processor, which connects 48 cores via a network-on- chip. We motivate our design decisions and describe the status of our implementation

University of Hertfordshire Research Archive

Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI

Author: Chandrakasan Anantha P.
Chen Chia-Hsin
Daya Bhavya Kishor
Krishna Tushar
Park Sunghyun
Peh Li-Shiuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

In this paper, we present a case study of our chip prototype of a 16-node 4x4 mesh NoC fabricated in 45nm SOI CMOS that aims to simultaneously optimize energy-latency-throughput for unicasts, multicasts and broadcasts. We first define and analyze the theoretical limits of a mesh NoC in latency, throughput and energy, then describe how we approach these limits through a combination of microarchitecture and circuit techniques. Our 1.1V 1GHz NoC chip achieves 1-cycle router-and-link latency at each hop and energy-efficient router-level multicast support, delivering 892Gb/s (87.1% of the theoretical bandwidth limit) at 531.4mW for a mixed traffic of unicasts and broadcasts. Through this fabrication, we derive insights that help guide our research, and we believe, will also be useful to the NoC and multicore research community

DSpace@MIT

Crossref

DVFS using clock scheduling for Multicore Systems-on-Chip and Networks-on-Chip

Author: Yadav MANOJ KUMAR
Publication venue: Politecnico di Torino
Publication date
Field of study

A modern System-on-Chip (SoC) contains processor cores, application-specific process- ing elements, memory, peripherals, all connected with a high-bandwidth and low-latency Network-on-Chip (NoC). The downside of such very high level of integration and con- nectivity is the high power consumption. In CMOS technology this is made of a dynamic and a static component. To reduce the dynamic component, Dynamic voltage and Fre- quency Scaling (DVFS) has been adopted. Although DVFS is very effective chip-wide, the power optimization of complex SoCs calls for a finer grain application of DVFS. Ideally all the main components of an SoC should be provided with a DVFS controller. An SoC with a DVFS controller per component with individual DC-DC converters and PLL/DLL circuits cannot scale in size to hundreds of components, which are in the research agenda. We present an alternative that will permit such scaling. It is possible to achieve results close to an optimum DVFS by hopping between few voltage levels and by an innovative application of clock-gating that we term as clock scheduling. We obtain an effective clock frequency by periodically killing some clock cycles of a master clock. We can apply voltage scaling for some of the periodic clock schedules which yield effective clock 1/2, 1/3, . . . By dithering between few voltages we obtain results close to an ideal DVFS system in simple pipelined circuits and in a complex example, a NoC’s switch. Again in the context of a NoC, we show how clock scheduling and voltage scaling can be automatically determined by means of a proportional-integral loop controller that keeps track of the network load. We describe in detail its implementation and all the circuit-level issues that we found. For a single switch, result shows an advantage of up to 2X over simple frequency scaling without voltage scaling. By providing each NoC’s switch with our simple DVFS controller, power saving at network level can be significantly more than what a a global DVFS controller can get. In a realistic scenario represented by network traces generated by video applications (MPEG, PIP, MWD, VoPD), we obtain an average power saving of 33%. To reduce static power, the Power-Gating (PG) technique is used and consists in switching- off power supply of unused blocks via pMOS headers or nMOS footers in series with such blocks. Even though research has been done in this field, the application of PG to NoCs has not been fully investigated. We show that it is possible to apply PG to the input buffers of a NoC switch. Their leakage power contributes about 40-50% of total NoC power, hence reducing such contribution is worthwhile. We partitioned buffers in banks and apply PG only to inactive banks. With our technique, it is possible to save about 40% in leakage power, without impact on performance

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Design Space Exploration and Resource Management of Multi/Many-Core Systems

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

Directory of Open Access Books (DOAB)

Control Techniques for Uncore Power Mangement in Chip Multiprocessor Designs

Author: Xu Zheng
Publication venue
Publication date: 16/12/2013
Field of study

In chip-multiprocessor (CMP) designs, when the number of core increases, the size of on-chip communication fabric and data storage grows accordingly and therefore the chip power challenge is exacerbated. This thesis work considers the power management for networks-on-chip (NoC) and the last level cache, which constitute the uncore in CMP designs. NoC is regarded as a scalable approach to cope with the increasing demand for on-chip communication bandwidth. The last level cache is shared among all cores. The focus of this work is on the control techniques for uncore dynamic voltage and frequency scaling. A realistic but not well-studied scenario is investigated. That is, the entire uncore shares a single voltage/frequency domain, as opposed to separated domains in most of previous works. One appealing advantage here is that data packets no longer experience the interfacing overhead across different voltage/frequency domains. The classic PI (Proportional and Integral) control method is adopted due to its simplicity, flexibility and low implementation overhead. This thesis research outcome includes three parts. First, stability of the PI control is analyzed. Second, a model-assisted PI control scheme is proposed and studied. The model assist is to address the problem that no universally good reference point exists for the control. Third, the windup issue for the PI control is investigated. Full architecture simulations are performed on public benchmark suites to validate the proposed techniques. The result show 76% energy reduction with less than 6% performance degradation compared to constantly high voltage/frequency for uncore

Texas A&M Repository

Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures

Author: Jafri Syed Mohammad Asad Hassan
Publication venue: Turku Centre for Computer Science
Publication date: 28/01/2015
Field of study

This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast

UTUPub

Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

Author: Yang Yoon Seok
Publication venue
Publication date: 11/01/2021
Field of study

This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

Texas A&M Repository

An Artificial Neural Networks based Temperature Prediction Framework for Network-on-Chip based Multicore Platform

Author: Aswath Narayana Sandeep
Publication venue: RIT Scholar Works
Publication date: 01/03/2016
Field of study

Continuous improvement in silicon process technologies has made possible the integration of hundreds of cores on a single chip. However, power and heat have become dominant constraints in designing these massive multicore chips causing issues with reliability, timing variations and reduced lifetime of the chips. Dynamic Thermal Management (DTM) is a solution to avoid high temperatures on the die. Typical DTM schemes only address core level thermal issues. However, the Network-on-chip (NoC) paradigm, which has emerged as an enabling methodology for integrating hundreds to thousands of cores on the same die can contribute significantly to the thermal issues. Moreover, the typical DTM is triggered reactively based on temperature measurements from on-chip thermal sensor requiring long reaction times whereas predictive DTM method estimates future temperature in advance, eliminating the chance of temperature overshoot. Artificial Neural Networks (ANNs) have been used in various domains for modeling and prediction with high accuracy due to its ability to learn and adapt. This thesis concentrates on designing an ANN prediction engine to predict the thermal profile of the cores and Network-on-Chip elements of the chip. This thermal profile of the chip is then used by the predictive DTM that combines both core level and network level DTM techniques. On-chip wireless interconnect which is recently envisioned to enable energy-efficient data exchange between cores in a multicore environment, will be used to provide a broadcast-capable medium to efficiently distribute thermal control messages to trigger and manage the DTM schemes

arXiv.org e-Print Archive

RIT Scholar Works