57 research outputs found

    Quantitative design space exploration of routing-switches for Network-on-Chip

    Get PDF
    Future Systems-on-Chip (SoC) will consist of many embedded functional units like e.g. embedded processor cores, memories or FPGA like structures. These SoCs will have huge communication demands, which can not be fulfilled by bus-based communication systems. Possible solutions to this problem are so called Networks-on-Chip (NoC). <br><br> These NoCs basically consist of network-interfaces which integrate functional units into the NoC and routing-switches which connect the network-interfaces. Here, VLSI-based routing-switch implementations are presented. The characteristics of these NoCs like performance and costs (e.g. silicon area respectively logic elements, power dissipation) depend on a variety of parameters. As a routing-switch is a key component of a NoC, the costs and performance of routing-switches are compared for different parameter combinations. Evaluated parameters are for example data word length, architecture of the routing-switch (parallel vs. centralized implementation) and routing-algorithm. <br><br> The performance and costs of routing-switches were evaluated using an FPGA-based NoC-emulator. In addition different routing-switches were implemented using a 90 nm standard-cell library to determine the maximum clock frequency, power-dissipation and area of a VLSI-implementation. The power consumption was determined by simulating the extracted layout of the routing-switches. Finally, these results are benchmarked to other routing-switch implementations like Aetheral and xpipes

    CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips

    Get PDF
    Manycore chips are emerging as the architecture of choice to provide power-scalability and improve performance while riding the Moore’s law. On-chip interconnects are increasingly playing a pivotal role in power- and performance- scalability of such microarchitectures. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular are resorting to specialization to provide power- and performance-scalability. In this paper, we make the observation that cache-coherent manycore chips exhibit a duality in on-chip network traffic. Request traffic typically consists of control packets requiring narrow low-power switches, while response traffic often carries cache block-sized payloads that require wider and higher-power switches. We present Cache-Coherence Network-on-Chip (CCNoC), a design to capitalize on this duality in traffic and provide a pair of asymmetric switches that optimize power and performance over conventional onchip interconnects. Cycle-accurate simulation results for a 4x4 chip multiprocessor with a shared last-level cache running commercial server workloads indicate 22% improvement in power over a torus and 38% improvement in power over a mesh with larger channel width, while providing similar performance

    High Peformance and Low Power On-Die Interconnect Fabrics.

    Full text link
    Increasing power density with technology scaling has caused stagnation in operating frequency of modern day microprocessors. This has led designers to prefer multicore architectures over complex monolithic processors to keep up with the demand for rising computing throughput. Although processing units are getting smaller and simpler, the dramatic rise of their count on a single die has made the fabric that connects these processing units increasingly complex. These interconnect fabrics have become a bottleneck in improving overall system effciency. As a result, the design paradigm for multi-core chips is gradually shifting from a core-centric architecture towards an interconnect-centric architecture, where system efficiency is limited by the fabric rather than the processing ability of any individual core. This dissertation introduces three novel and synergistic circuit techniques to improve scalability of switch fabrics to make on-die integration of hundreds to thousands of cores feasible. 1) A matrix topology is proposed for designing a fully connected switch fabric that re-uses output buses for programming, and stores shue congurations at cross points. This significantly reduces routing congestion, lowers area/power, and improves per- formance. Silicon measurements demonstrate 47% energy savings in a 64-lane SIMD processor fabricated in 65nm CMOS over a conventional implementation. 2) A novel approach to handle high radix arbitration along with data routing is proposed. It optimally uses existing cross-bar interconnect resources without requiring any additional overhead. Bandwidth exceeding 2Tb/s is recorded in a test prototype fabricated in 65nm. 3) Building on the later, a new circuit topology to manage and update priority adaptively within the switch fabric without incurring additional delay or area is then proposed. Several assist circuit techniques, such as a thyristor based sense amplifier and self regenerating bi-directional repeaters are proposed for high speed energy efficient signaling to and from the switch fabric to improve overall routing efficiency. Using these techniques a 64 x 64 switch fabric with 128b data bus fabricated in 45nm achieves a throughput of 4.5Tb/s at single cycle latency while operating at 559MHz.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91506/1/sudhirks_1.pd

    Runtime network-on-chip thermal and power balancing

    Get PDF
    In Network-on-Chip (NoC), most thermal and peak power balancing methods are monitored by centralized power/thermal managers. This increases inter-core communication latency and imbalanced thermal distribution. These factors directly affect the hot spot formation caused by high power densities developed with increasing per-core transistor number. As a result, reliability decreases along with static power dissipation. This proposal aims to introduce hierarchical agents for balancing power and thermal distribution by manipulating system’s parameters such as power, thermal, voltage below the optimal values. As some level of control is applied, this proposal also targets to achieve network scalability by implementing some level of independencies; self-organize and self-optimize distributed agent in overcoming core-level homogeneous processing element (PE) thermal and power variations at runtime. The aims of this work are to significantly contribute to achieving runtime thermal and power balancing, power and thermal management and reducing thermal hot spot formation in NoC

    Quantitative comparison of performance analysis techniques for modular and generic network-on-chip

    Get PDF
    NoC-specific parameters feature a huge impact on performance and implementation costs of NoC. Hence, performance and cost evaluation of these parameter-dependent NoC is crucial in different design-stages but the requirements on performance analysis differ from stage to stage. In an early design-stage an analysis technique featuring reduced complexity and limited accuracy can be applied, whereas in subsequent design-stages more accurate techniques are required. <br><br> In this work several performance analysis techniques at different levels of abstraction are presented and quantitatively compared. These techniques include a static performance analysis using timing-models, a Colored Petri Net-based approach, VHDL- and SystemC-based simulators and an FPGA-based emulator. Conducting NoC-experiments with NoC-sizes from 9 to 36 functional units and various traffic patterns, characteristics of these experiments concerning accuracy, complexity and effort are derived. <br><br> The performance analysis techniques discussed here are quantitatively evaluated and finally assigned to the appropriate design-stages in an automated NoC-design-flow

    A Unified Operating System for Clouds and Manycore: fos

    Get PDF
    Single chip processors with thousands of cores will be available in the next ten years and clouds of multicore processors afford the operating system designer thousands of cores today. Constructing operating systems for manycore and cloud systems face similar challenges. This work identifies these shared challenges and introduces our solution: a factored operating system (fos) designed to meet the scalability, faultiness, variability of demand, and programming challenges of OSâ s for single-chip thousand-core manycore systems as well as current day cloud computers. Current monolithic operating systems are not well suited for manycores and clouds as they have taken an evolutionary approach to scaling such as adding fine grain locks and redesigning subsystems, however these approaches do not increase scalability quickly enough. fos addresses the OS scalability challenge by using a message passing design and is composed out of a collection of Internet inspired servers. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but provide traditional kernel services instead of Internet services. Also, fos embraces the elasticity of cloud and manycore platforms by adapting resource utilization to match demand. fos facilitates writing applications across the cloud by providing a single system image across both future 1000+ core manycores and current day Infrastructure as a Service cloud computers. In contrast, current cloud environments do not provide a single system image and introduce complexity for the user by requiring different programming models for intra- vs inter-machine communication, and by requiring the use of non-OS standard management tools

    A Parallel Routing Algorithm for Torus NoCs

    Get PDF
    Abstract. This paper proposes a parallel routing algorithm for routing multiple data streams over disjoint paths in the torus Network-on-Chip (NoC) architecture. We show how to construct a maximal set of disjoint paths between any two nodes of a torus network topology and then make use of the constructed paths for the simultaneous routing of multiple data streams between these nodes. Analytical performance evaluation results are obtained showing the effectiveness of the proposed parallel routing algorithm in reducing communication delays and increasing throughput when transferring large amounts of data in an NoC-based multi-core system
    corecore