230 research outputs found

    A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

    Full text link
    Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.Mukherjee, S.; Silla Jiménez, F.; Bannon, P.; Emer, J.; Lang, S.; Webb, D. (2002). A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. ACM SIGPLAN Notices. 37(10):223-234. doi:10.1145/605432.605421S223234371

    Thermal profiling of homogeneous multi-core processors using sensor mini-networks

    Get PDF
    With large-scale integration and high power density in current generation microprocessors, thermal management is becoming a critical component of system design. Specifically, accurate thermal monitoring using on-die sensors is vital for system reliability and recovery. Achieving an accurate thermal profile of a system with an optimal number of sensors is integral for thermal management. This work focuses on a sensor placement mechanism and an on-chip sensor mini-network to combine temperatures from multiple sensors to determine the full thermal profile of a chip. The sensor placement mechanism proposed in this work uses non-uniform subsampling of thermal maps with k-means clustering. Using this sensing technique with cubic interpolation, an 8-core architecture thermal map was successfully recovered with an average error improvement of 90% over sensor placement via basic k-means clustering. All the simulations were run using HotSpot 5.0 modeling Alpha 21364 processor as a baseline core. The sensor mini-network using both differential encoding and distributed source coding was analyzed on a 1024-core architecture. Distributed source coding compression required fewer transmissions than differential encoding and reduced the number of transmitted bits by 36% over a sensor mini-network with no compression

    Effects of injection pressure on network throughput

    Get PDF
    ©2006 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.Recent parallel systems use multiple injection ports and various injection policies, but little is known about their impact on network performance. This paper evaluates the influence that these injection interfaces have on maximum sustained throughput in adaptive cut-through torus networks by modeling the number of injection queues (1 or 4), and the allocation of new packets to those queues. Network evaluations for medium to large size 2D tori show that designs with multiple injection ports do not improve performance under uniform traffic. On the contrary, they result in more pressure from the injection interface to acquire the scarce network resources of an already clogged system. Interestingly, for small networks, a single injection FIFO queue, with the HOLB it entails, indirectly provides the much needed injection control. For networks with thousands of nodes and multiple injection channels, as those being implemented in current massively parallel processors, this implicit form of congestion control is not enough. In such systems, restrictive injection policies are required to prevent routers from being flooded with new packets for loads beyond saturation.C. Izu, J. Miguel-Alonso, J.A. Gregori

    A Temperature and Reliability Oriented Simulation Framework for Multi-core Architectures

    Get PDF
    The increasing complexity of multi-core architectures demands for a comprehensive evaluation of different solutions and alternatives at every stage of the design process, considering different aspects at the same time. Simulation frameworks are attractive tools to fulfil this requirement, due to their flexibility. Nevertheless, state-of-the-art simulation frameworks lack a joint analysis of power, performance, temperature profile and reliability projection at system-level, focusing only on a specific aspect. This paper presents a comprehensive estimation framework that jointly exploits these design metrics at system-level, considering processing cores, interconnect design and storage elements. We describe the framework in details, and provide a set of experiments that highlight its capability and flexibility, focusing on temperature and reliability analysis of multi-core architectures supported by Network-on-Chip interconnect

    Exploiting a new level of DLP in multimedia applications

    Get PDF
    This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia applications. By fusing conventional vector ISA approaches together with more recent SIMD-like (Single Instruction Multiple Data) ISAs (such as MMX), we have developed a new matrix oriented ISA which efficiently deals with the small matrix structures typically found in multimedia applications. MOM exploits a level of DLP not reachable by neither conventional vector ISAs nor SIMD-like media ISA extensions. Our results show that MOM provides a factor of 1.3x to 4x performance improvement when compared with two different multimedia extensions (MMX and MDMX) on several kernels, which translates into up to a 50% of performance gain when measuring full applications (20% in average). Furthermore, the streaming nature of MOM provides additional advantages for executing multimedia applications, such as a very low fetch pressure or a high tolerance to memory latency, making MOM an ideal candidate for the embedded domain.Peer ReviewedPostprint (published version

    Temperature Sensor Placement Including Routing Overhead and Sampling Inaccuracies

    Get PDF
    Dynamic thermal management techniques require a collection of on-chip thermal sensors that imply a significant area and power overhead. Finding the optimum number of temperature monitors and their location on the chip surface to optimize accuracy is an NP-hard problem. In this work we improve the modeling of the problem by including area, power and networking constraints along with the consideration of three inaccuracy terms: spatial errors, sampling rate errors and monitor-inherent errors. The problem is solved by the simulated annealing algorithm. We apply the algorithm to a test case employing three different types of monitors to highlight the importance of the different metrics. Finally we present a case study of the Alpha 21364 processor under two different constraint scenarios

    A Light-Weight On-Chip Monitoring Network for Dynamic Adaptation and Calibration

    Full text link
    Current nanometer technologies suffer within-die parameter uncertainties, varying workload conditions, aging, and temperature effects that cause a serious reduction on yield and performance. In this scenario, monitoring, calibration, and dynamic adaptation become essential, demanding systems with a collection of multi purpose monitors and exposing the need for light-weight monitoring networks. This paper presents a new monitoring network paradigm able to perform an early prioritization of the information. This is achieved by the introduction of a new hierarchy level, the threshing level. Targeting it, we propose a time-domain signaling scheme over a single-wire that minimizes the network switching activity as well as the routing requirements. To validate our approach, we make a thorough analysis of the architectural trade-offs and expose two complete monitoring systems that suppose an area improvement of 40% and a power reduction of three orders of magnitude compared to previous works

    On the design of a high-performance adaptive router for CC-NUMA multiprocessors

    Get PDF
    Copyright © 2003 IEEEThis work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade off between network performance and hardware cost. The outcome of this research is a High-Performance Adaptive Router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input buffering, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. The throughput gains ranged from 10 percent to 40 percent in respect to its most direct rival, which employs more hardware resources. Other results shown that HPAR achieves up to 83 percent of its theoretical maximum throughput under random traffic and up to 70 percent when running real applications. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors.Valentín Puente, José-Ángel Gregorio, Ramón Beivide, and Cruz Iz
    corecore