66 research outputs found

    SWIFT: A Low-Power Network-On-Chip Implementing the Token Flow Control Router Architecture With Swing-Reduced Interconnects

    Get PDF
    A 64-bit, 8 × 8 mesh network-on-chip (NoC) is presented that uses both new architectural and circuit design techniques to improve on-chip network energy-efficiency, latency, and throughput. First, we propose token flow control, which enables bypassing of flit buffering in routers, thereby reducing buffer size and their power consumption. We also incorporate reduced-swing signaling in on-chip links and crossbars to minimize datapath interconnect energy. The 64-node NoC is experimentally validated with a 2 × 2 test chip in 90 nm, 1.2 V CMOS that incorporates traffic generators to emulate the traffic of the full network. Compared with a fully synthesized baseline 8 × 8 NoC architecture designed to meet the same peak throughput, the fabricated prototype reduces network latency by 20% under uniform random traffic, when both networks are run at their maximum operating frequencies. When operated at the same frequencies, the SWIFT NoC reduces network power by 38% and 25% at saturation and low loads, respectively

    Low-swing signaling for energy efficient on-chip networks

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 65-69).On-chip networks have emerged as a scalable and high-bandwidth communication fabric in many-core processor chips. However, the energy consumption of these networks is becoming comparable to that of computation cores, making further scaling of core counts difficult. This thesis makes several contributions to low-swing signaling circuit design for the energy efficient on-chip networks in two separate projects: on-chip networks optimized for one-to-many multicasts and broadcasts, and link designs that allow on-chip networks to approach an ideal interconnection fabric. A low-swing crossbar switch, which is based on tri-state Reduced-Swing Drivers (RSDs), is presented for the first project. Measurement results of its test chip fabricated in 45nm SOI CMOS show that the tri-state RSD-based crossbar enables 55% power savings as compared to an equivalent full-swing crossbar and link. Also, the measurement results show that the proposed crossbar allows the broadcast-optimized on-chip networks using a single pipeline stage for physical data transmission to operate at 21% higher data rate, when compared with the full-swing networks. For the second project, two clockless low-swing repeaters, a Self-Resetting Logic Repeater (SRLR) and a Voltage-Locked Repeater (VLR), have been proposed and analyzed in simulation only. They both require no reference clock, differential signaling, and bias current. Such digital-intensive properties enable them to approach energy and delay performance of a point-to-point interconnect of variable lengths. Simulated in 45nm SOI CMOS, the 10mm SRLR featured with high energy efficiency consumes 338fJ/b at 5.4Gb/s/ch while the 10mm VLR raises its data rate up to 16.OGb/s/ch with 427fJ/b.by Sunghyun Park.S.M

    Exploiting Properties of CMP Cache Traffic in Designing Hybrid Packet/Circuit Switched NoCs

    Get PDF
    Chip multiprocessors with few to tens of processing cores are already commercially available. Increased scaling of technology is making it feasible to integrate even more cores on a single chip. Providing the cores with fast access to data is vital to overall system performance. When a core requires access to a piece of data, the core's private cache memory is searched first. If a miss occurs, the data is looked up in the next level(s) of the memory hierarchy, where often one or more levels of cache are shared between two or more cores. Communication between the cores and the slices of the on-chip shared cache is carried through the network-on-chip(NoC). Interestingly, the cache and NoC mutually affect the operation of each other; communication over the NoC affects the access latency of cache data, while the cache organization generates the coherence and data messages, thus affecting the communication patterns and latency over the NoC. This thesis considers hybrid packet/circuit switched NoCs, i.e., packet switched NoCs enhanced with the ability to configure circuits. The communication and performance benefit that come from using circuits is predicated on amortizing the time cost incurred for configuring the circuits. To address this challenge, NoC designs are proposed that take advantage of properties of the cache traffic, namely temporal locality and predictability, to amortize or hide the circuit configuration time cost. First, a coarse-grained circuit configuration policy is proposed that exploits the temporal locality in the cache traffic to periodically configure circuits for the heavily communicating nodes. This allows the design of a locality-aware cache that promotes temporal communication locality through data placement, while designing suitable data replacement and migration policies. Next, a fine-grained configuration policy, called Déjà Vu switching, is proposed for leveraging predictability of data messages by initiating a circuit configuration as soon as a cache hit is detected and before the data becomes available. Its benefit is demonstrated for saving interconnect energy in multi-plane NoCs. Finally, a more proactive configuration policy is proposed for fast caches, where circuit reservations are initiated by request messages, which can greatly improve communication latency and system performance

    Architectural Support for Efficient Communication in Future Microprocessors

    Get PDF
    Traditionally, the microprocessor design has focused on the computational aspects of the problem at hand. However, as the number of components on a single chip continues to increase, the design of communication architecture has become a crucial and dominating factor in defining performance models of the overall system. On-chip networks, also known as Networks-on-Chip (NoC), emerged recently as a promising architecture to coordinate chip-wide communication. Although there are numerous interconnection network studies in an inter-chip environment, an intra-chip network design poses a number of substantial challenges to this well-established interconnection network field. This research investigates designs and applications of on-chip interconnection network in next-generation microprocessors for optimizing performance, power consumption, and area cost. First, we present domain-specific NoC designs targeted to large-scale and wire-delay dominated L2 cache systems. The domain-specifically designed interconnect shows 38% performance improvement and uses only 12% of the mesh-based interconnect. Then, we present a methodology of communication characterization in parallel programs and application of characterization results to long-channel reconfiguration. Reconfigured long channels suited to communication patterns enhance the latency of the mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally, we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core systems, cache traffic has 69% compressibility and shows high value sharing among flows. Compression-enabled NoC improves the latency by up to 63% and saves energy consumption by up to 12%

    STT-MRAM Based NoC Buffer Design

    Get PDF
    As Chip Multiprocessor (CMP) design moves toward many-core architectures, communication delay in Network-on-Chip (NoC) is a major bottleneck in CMP design. An emerging non-volatile memory - STT MRAM (Spin-Torque Transfer Magnetic RAM) which provides substantial power and area savings, near zero leakage power, and displays higher memory density compared to conventional SRAM. But STT-MRAM suffers from inherit drawbacks like multi cycle write latency and high write power consumption. So, these problem have to addressed in order to have an efficient design to incorporate STT-MRAM for NoC input buffer instead of traditional SRAM based input buffer design. Motivated by short intra-router latency, previously proposed write latency reduction technique is explored by sacrificing retention time and a hybrid design of input buffers using both SRAM and STT-MRAM to "hide" the long write latency efficiently is proposed. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer is also proposed

    Global Congestion and Fault Aware Wireless Interconnection Framework for Multicore Systems

    Get PDF
    Multicore processors are getting more common in the implementation of all type of computing demands, starting from personal computers to the large server farms for high computational demanding applications. The network-on-chip provides a better alternative to the traditional bus based communication infrastructure for this multicore system. Conventional wire-based NoC interconnect faces constraints due to their long multi-hop latency and high power consumption. Furthermore high traffic generating applications sometimes creates congestion in such system further degrading the systems performance. In this thesis work, a novel two-state congestion aware wireless interconnection framework for network chip is presented. This WiNoC system was designed to able to dynamically redirect traffic to avoid congestion based on network condition information shared among all the core tiles in the system. Hence a novel routing scheme and a two-state MAC protocol is proposed based on a proposed two layer hybrid mesh-based NoC architecture. The underlying mesh network is connected via wired-based interconnect and on top of that a shared wireless interconnect framework is added for single-hop communication. The routing scheme is non-deterministic in nature and utilizes the principles from existing dynamic routing algorithms. The MAC protocol for the wireless interface works in two modes. The first is data mode where a token-based protocol is utilized to transfer core data. And the second mode is the control mode where a broadcast-based communication protocol is used to share the network congestion information. The work details the switching methodology between these two modes and also explain, how the routing scheme utilizes the congestion information (gathered during the control mode) to route data packets during normal operation mode. The proposed work was modeled in a cycle accurate network simulator and its performance were evaluated against traditional NoC and WiNoC designs

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout
    • …
    corecore