46 research outputs found

    Fly-Over: A Light-Weight Distributed Router Power-Gating Mechanism for Energy-Efficient Interconnects

    Get PDF
    Scalable Networks-on-chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but they either required centralized decision making and global network knowledge or a non-scalable escape ring network. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power gating routers, which encompasses FLOV router microarchitecture and a partition-based dynamic routing algorithm to maintain network functionality. With simple modifications to the baseline router microarchitecture, FLOV can facilitate fly-over links over power-gated routers. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our scheme using both unicast and multicast synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. The results show that FLOV can achieve 19.2% latency reduction and 16.9% total power savings

    Best Effort Minimal Routing for Fly-Over: A Light Weight Distributed Mechanism for Energy Efficient Network-On-Chip

    Get PDF
    Scalable Networks-on-Chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. NoCs devour a large fraction of the on-chip power budget of which static NoC power consumption is becoming the dominant component as technology scales down. Hence, reducing static NoC power consumption is critical for energy-efficient computing. Previous research suggests power-gating routers attached to inactive cores so as to save static power, but requires centralized control and global network knowledge. Moreover, packet deliveries in irregular power-gated network suffer from detour or waiting time overhead to either route around or wake up off routers. Fly-Over (FLOV) is a distributed power-gating mechanism to minimize static power consumption in NoCs without the need for global network information. However, the existing FLOV routing algorithm introduces unnecessary detours and pressurizes the routers in AON column resulting in high packet latencies and network congestion. This work proposes FLOV+, Best-Effort Minimal Routing Algorithm for Fly-Over (FLOV) to route the packets using the shortest path in an irregular power-gated network and also relieve the stress on the AON column. This routing algorithm aims to minimize the average packet latency and to sustain throughput in network with power-gated routers. Synthetic workload evaluations show that the proposed algorithm reduces average packet latency upto 9.84% in an 8-dimensional mesh network. Simulation results also show 50% and 40% improvement in the network throughput for restricted FLOV and generalized FLOV power gating mechanisms respectively

    BlackOut: Enabling fine-grained power gating of buffers in Network-on-Chip routers

    Get PDF
    The Network-on-Chip (NoC) router buffers play an instrumental role in the performance of both the interconnection fabric and the entire multi-/many-core system. Nevertheless, the buffers also constitute the major leakage power consumers in NoC implementations. Traditionally, they are designed to accommodate worst-case traffic scenarios, so they tend to remain idle, or under-utilized, for extended periods of time. The under-utilization of these valuable resources is exemplified when one profiles real application workloads; the generated traffic is bursty in nature, whereby high traffic periods are sporadic and infrequent, in general. The mitigation of the leakage power consumption of NoC buffers via power gating has been explored in the literature, both at coarse (router-level) and fine (buffer-level) granularities. However, power gating at the router granularity is suitable only for low and medium traffic conditions, where the routers have enough opportunities to be powered down. Under high traffic, the sleeping potential rapidly diminishes. Moreover, disabling an entire router greatly affects the NoC functionality and the network connectivity. This article presents BlackOut, a fine-grained power-gating methodology targeting individual router buffers. The goal is to minimize leakage power consumption, without adversely impacting the system performance. The proposed framework is agnostic of the routing algorithm and the network topology, and it is applicable to any router micro-architecture. Evaluation results obtained using both synthetic traffic patterns and real applications in 64-core systems indicate energy savings of up to 70%, as compared to a baseline NoC, with a near-negligible performance overhead of around 2%. BlackOut is also shown to significantly outperformby 35%, on averagetwo current state-of-the-art power-gating solutions, in terms of energy savings. Not tailored to any topology, routing algorithm and NoC router architecture.Router-to-router communication. No need for custom, region-based/global networks.Effective at low, medium and high traffic. Other solutions are more restrictive.+35% energy saving, on average, against two state-of-the-art power-gating solutions.Negligible performance overhead (+2%) compared to the baseline architecture

    A survey of system level power management schemes in the dark-silicon era for many-core architectures

    Get PDF
    Power consumption in Complementary Metal Oxide Semiconductor (CMOS) technology has escalated to a point that only a fractional part of many-core chips can be powered-on at a time. Fortunately, this fraction can be increased at the expense of performance through the dark-silicon solution. However, with many-core integration set to be heading towards its thousands, power consumption and temperature increases per time, meaning the number of active nodes must be reduced drastically. Therefore, optimized techniques are demanded for continuous advancement in technology. Existing eļ¬€orts try to overcome this challenge by activating nodes from diļ¬€erent parts of the chip at the expense of communication latency. Other eļ¬€orts on the other hand employ run-time power management techniques to manage the power performance of the cores trading-oļ¬€ performance for power. We found out that, for a signiļ¬cant amount of power to saved and high temperature to be avoided, focus should be on reducing the power consumption of all the on-chip components. Especially, the memory hierarchy and the interconnect. Power consumption can be minimized by, reducing the size of high leakage power dissipating elements, turning-oļ¬€ idle resources and integrating power saving materials

    Application Centric Networks-On-Chip Design Solutions for Future Multicore Systems

    Get PDF
    With advances in technology, future multicore systems scaled to 100s and 1000s of cores/accelerators are being touted as an effective solution for extracting huge performance gains using parallel programming paradigms. However with the failure of Dennard Scaling all the components on the chip cannot be run simultaneously without breaking the power and thermal constraints leading to strict chip power envelops. The scaling up of the number of on chip components has also brought upon Networks-On-Chip (NoC) based interconnect designs like 2D mesh. The contribution of NoC to the total on chip power and overall performance has been increasing steadily and hence high performance power-efficient NoC designs are becoming crucial. Future multicore paradigms can be broadly classified, based on the applications they are tailored to, into traditional Chip Multi processor(CMP) based application based systems, characterized by low core and NoC utilization, and emerging big data application based systems, characterized by large amounts of data movement necessitating high throughput requirements. To this order, we propose NoC design solutions for power-savings in future CMPs tailored to traditional applications and higher effective throughput gains in multicore systems tailored to bandwidth intensive applications. First, we propose Fly-over, a light-weight distributed mechanism for power-gating routers attached to switched off cores to reduce NoC power consumption in low load CMP environment. Secondly, we plan on utilizing a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM(STT-MRAM), to achieve enhanced NoC performance to satisfy the high throughput demands in emerging bandwidth intensive applications, while reducing the power consumption simultaneously. Thirdly, we present a hardware data approximation framework for NoCs, APPROX-NoC, with an online data error control mechanism, which can leverage the approximate computing paradigm in the emerging data intensive big data applications to attain higher performance per watt

    Variable-width datapath for on-chip network static power reduction

    Full text link

    DDRNoC: Dual Data-Rate Network-on-Chip

    Get PDF
    Networks-on-Chip (NoCs) are becoming increasing important for the performance of modern multi-core system-on-chip. For various on-chip networks with virtual channel (VC) ow control, the slow control logic (VC and switch allocation logic) of the NoC routers limits the NoC clock period while their datapath (switch and link) possesses signifcant slack. This slack results in wasted performance potential of the datapath, limits the saturation throughput of the network and reduces its energy efficiency. The aim of this thesis is to improve NoC performance by eliminating this slack and removing control logic from the router critical path. To this end, this thesis presents the Dual Data-Rate (DDR) network architecture called the DDRNoC. It utilizes the NoC datapath twice with in a clock cycle to forward its at DDR. This not only exploits the slack present in the datapath but also requires a clock with period twice the datapath delay, thus removing the shorter control logic from the critical path. This enables the DDRNoC to achieve throughput higher than single data-rate networks. Moreover, the DDRNoC also employs lookahead signalling to reduce end-to-end packet latency. FreewayNoC, an extension to the DDRNoC supplements the DDRNoC with simplified pipeline stage bypassing to reduce the zero-load latency of packets in the network. Implementation of the DDRNoC and FreewayNoC architectures require redesign of the switch allocation (SA) mechanism to resolve contention among competing its by granting up to two its access to each switch input and output port per clock cycle. It further requires separate paths for the propagation of lookahead control signals. FreewayNoC also requires implementation of multiple checks to guarantee con ict-free bypassing of the SA stage. Physical implementation results using 28nm process technology show that DDRNoC and FreewayNoC have 5% and 15% area overhead, respectively, compared to a simple 3-stage network with VCs. Performance evaluation shows that for a 16X16 mesh network, FreewayNoC supports 25% higher throughput compared to current state-of-the-art NoC, ShortPath. Moreover, FreewayNoC achieves a zero-load latency which scales better than ShortPath and equally well with an ideal network that has no control overheads. For application driven traffic, FreewayNoC reduces average packet latency by 18% compared to ShortPath. Alternatively, low voltage implementation of the DDRNoC and FreewayNoC can be used to conserve power and improve energy efficiency at the cost of higher packet latency

    Efficient Interconnection Network Design for Heterogeneous Architectures

    Get PDF
    The onset of big data and deep learning applications, mixed with conventional general-purpose programs, have driven computer architecture to embrace heterogeneity with specialization. With the ever-increasing interconnected chip components, future architectures are required to operate under a stricter power budget and process emerging big data applications efficiently. Interconnection network as the communication backbone thus is facing the grand challenges of limited power envelope, data movement and performance scaling. This dissertation provides interconnect solutions that are specialized to application requirements towards power-/energy-efficient and high-performance computing for heterogeneous architectures. This dissertation examines the challenges of network-on-chip router power-gating techniques for general-purpose workloads to save static power. A voting approach is proposed as an adaptive power-gating policy that considers both local and global traffic status through router voting. In addition, low-latency routing algorithms are designed to guarantee performance in irregular power-gating networks. This holistic solution not only saves power but also avoids performance overhead. This research also introduces emerging computation paradigms to interconnects for big data applications to mitigate the pressure of data movement. Approximate network-on-chip is proposed to achieve high-throughput communication by means of lossy compression. Then, near-data processing is combined with in-network computing to further improve performance while reducing data movement. The two schemes are general to play as plug-ins for different network topologies and routing algorithms. To tackle the challenging computational requirements of deep learning workloads, this dissertation investigates the compelling opportunities of communication algorithm-architecture co-design to accelerate distributed deep learning. MultiTree allreduce algorithm is proposed to bond with message scheduling with network topology to achieve faster and contention-free communication. In addition, the interconnect hardware and flow control are also specialized to exploit deep learning communication characteristics and fulfill the algorithm needs, thereby effectively improving the performance and scalability. By considering application and algorithm characteristics, this research shows that interconnection network can be tailored accordingly to improve the power-/energy-efficiency and performance to satisfy heterogeneous computation and communication requirements

    Physical parameter-aware Networks-on-Chip design

    Get PDF
    PhD ThesisNetworks-on-Chip (NoCs) have been proposed as a scalable, reliable and power-efficient communication fabric for chip multiprocessors (CMPs) and multiprocessor systems-on-chip (MPSoCs). NoCs determine both the performance and the reliability of such systems, with a significant power demand that is expected to increase due to developments in both technology and architecture. In terms of architecture, an important trend in many-core systems architecture is to increase the number of cores on a chip while reducing their individual complexity. This trend increases communication power relative to computation power. Moreover, technology-wise, power-hungry wires are dominating logic as power consumers as technology scales down. For these reasons, the design of future very large scale integration (VLSI) systems is moving from being computation-centric to communication-centric. On the other hand, chipā€™s physical parameters integrity, especially power and thermal integrity, is crucial for reliable VLSI systems. However, guaranteeing this integrity is becoming increasingly difficult with the higher scale of integration due to increased power density and operating frequencies that result in continuously increasing temperature and voltage drops in the chip. This is a challenge that may prevent further shrinking of devices. Thus, tackling the challenge of power and thermal integrity of future many-core systems at only one level of abstraction, the chip and package design for example, is no longer sufficient to ensure the integrity of physical parameters. New designtime and run-time strategies may need to work together at different levels of abstraction, such as package, application, network, to provide the required physical parameter integrity for these large systems. This necessitates strategies that work at the level of the on-chip network with its rising power budget. This thesis proposes models, techniques and architectures to improve power and thermal integrity of Network-on-Chip (NoC)-based many-core systems. The thesis is composed of two major parts: i) minimization and modelling of power supply variations to improve power integrity; and ii) dynamic thermal adaptation to improve thermal integrity. This thesis makes four major contributions. The first is a computational model of on-chip power supply variations in NoCs. The proposed model embeds a power delivery model, an NoC activity simulator and a power model. The model is verified with SPICE simulation and employed to analyse power supply variations in synthetic and real NoC workloads. Novel observations regarding power supply noise correlation with different traffic patterns and routing algorithms are found. The second is a new application mapping strategy aiming vii to minimize power supply noise in NoCs. This is achieved by defining a new metric, switching activity density, and employing a force-based objective function that results in minimizing switching density. Significant reductions in power supply noise (PSN) are achieved with a low energy penalty. This reduction in PSN also results in a better link timing accuracy. The third contribution is a new dynamic thermal-adaptive routing strategy to effectively diffuse heat from the NoC-based threedimensional (3D) CMPs, using a dynamic programming (DP)-based distributed control architecture. Moreover, a new approach for efficient extension of two-dimensional (2D) partially-adaptive routing algorithms to 3D is presented. This approach improves three-dimensional networkon- chip (3D NoC) routing adaptivity while ensuring deadlock-freeness. Finally, the proposed thermal-adaptive routing is implemented in field-programmable gate array (FPGA), and implementation challenges, for both thermal sensing and the dynamic control architecture are addressed. The proposed routing implementation is evaluated in terms of both functionality and performance. The methodologies and architectures proposed in this thesis open a new direction for improving the power and thermal integrity of future NoC-based 2D and 3D many-core architectures
    corecore