197 research outputs found

    An efficient 2D router architecture for extending the performance of inhomogeneous 3D NoC-based multi-core architectures

    Get PDF
    To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects, alternative interconnect fabrics such as inhomogeneous three dimensional integrated Network-on-Chip (3D NoC) has emanated as a cost-effective solution for emerging multi-core design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers. Consequently, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in inhomogeneous 3D NoCs. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able to balance the traffic in the network to reduce the average packet latency under various traffic loads. Simulation shows that, the proposed router can reduce the average packet delay by an average of 45% in 3D NoCs

    Priority Based Switch Allocator in Adaptive Physical Channel Regulator for On Chip Interconnects

    Get PDF
    Chip multiprocessors (CMPs) are now popular design paradigm for microprocessors due to their power, performance and complexity advantages where a number of relatively simple cores are integrated on a single die. On chip interconnection network (NoC) is an excellent architectural paradigm which offers a stable and generalized communication platform for large scale of chip multiprocessors. The existing model APCR has three regulation schemes designed at switch allocation stage of NoC router pipelining, such as monopolizing, fair-sharing and channel-stealing. Its aim is to fairly allocate physical bandwidth in the form of flit level transmission unit while breaking the conventional assumptions i.e.its size is same as phit size. They have implemented channel-stealing scheme using the existing round-robin scheduler which is a well known scheduling algorithm for providing fairness, which is not an optimal solution. In this thesis, we have extended the efficiency of APCR model and propose three efficient scheduling policies for the channel stealing scheme in order to provide better quality of service (QoS). Our work can be divided into three parts. In the first part, we implemented ratio based scheduling technique in which we keep track of average number of its sent from each input in every cycle. It not only provides fairness among virtual channels (VCs), but also increases the saturation throughput of the network. In the second part, we have implemented an age based scheduling technique where we prioritize the VC, based on the age of the requesting flits. The age of each request is calculated as the difference between the time of injection and the current simulation time. Age based scheduler minimizes the packet latency. In the last part, we implemented a Static-Priority based scheduler. In this case, we arbitrarily assign random priorities to the packets at the time of their injection into the network. In this case, the high priority packets can be forwarded to any of the VCs, whereas the low priority packets can be forwarded to a limited number of VCs. So, basically Static-Priority based scheduler limits the accessibility on the number of VCs depending upon the packet priority. We study the performance metrics such as the average packet latency, and saturation throughput resulted by all the three new scheduling techniques. We demonstrate our simulation results for all three scheduling policies i.e. bit complement, transpose and uniform random considering from very low (no load) to high load injection rates. We evaluate the performance improvement because of our proposed scheduling techniques in APCR comparing with the performance of basic NoC design. The performance is also compared with the results found in monopolizing, fair-sharing and round-robin schemes for channel-stealing of APCR. It is observed from the simulation results using our detailed cycle-accurate simulator that our new scheduling policies implemented in APCR model improves the network throughput by 10% in case of synthetic workloads, compared with the existing round-robin scheme. Also, our scheduling policy in APCR model outperforms the baseline router by 28X under synthetic workloads

    VLPW: The Very Long Packet Window Architecture for High Throughput Network-On-Chip Router Designs

    Get PDF
    ChipMulti-processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, Network-On-Chip (NOC) provides a scalable communication method for CMPs. NOC must be carefully designed to provide low latencies and high throughput in the resource-constrained environment. To improve the network throughput, we propose the Very Long Packet Window (VLPW) architecture for the NOC router design that tries to close the throughput gap between state-of-the-art on-chip routers and the ideal interconnect fabric. To improve throughput, VLPW optimizes Switch Allocation (SA) efficiency. Existing SA normally applies Round-Robin scheduling to arbitrate among the packets targeting the same output port. However, this simple approach suffers from low arbitration efficiency and incurs low network throughput. Instead of relying solely on simple switch scheduling, the VLPW router design globally schedules all the input packets, resolves the output conflicts and achieves high throughput. With the VLPW architecture, we propose two scheduling schemes: Global Fairness and Global Diversity. Our simulation results show that the VLPW router achieves more than 20% throughput improvement without negative effects on zero-load latency

    Adaptive memory-side last-level GPU caching

    Get PDF
    Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices. In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches

    Extending the performance of hybrid NoCs beyond the limitations of network heterogeneity

    Get PDF
    To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects (wireline), alternative interconnect fabrics such as inhomogeneous three-dimensional integrated Network-on-Chip (3D NoC) and hybrid wired-wireless Network-on-Chip (WiNoC) have emanated as a cost-effective solution for emerging System-on-Chip (SoC) design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers and wireless nodes. Moreover, the non-uniform distributed traffic in chip multiprocessor (CMP) demands an on-chip communication infrastructure which can avoid congestion under high traffic conditions while possessing minimal pipeline delay at low-load conditions. To this end, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in such emerging hybrid NoCs. The proposed router transmits a flit using dimension-ordered routing (DoR) in the bypass datapath at low-loads. When the output port required for intra-dimension bypassing is not available, the packet is routed adaptively to avoid congestion. The router also has a simplified virtual channel allocation (VA) scheme that yields a non-speculative low-latency pipeline. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able balance the traffic in hybrid NoCs to achieve low-latency communication under various traffic loads. Simulation shows that, the proposed router can reduce applications’ execution time by an average of 16.9% compared to low-latency routers such as SWIFT. By reducing the latency between 2D routers (or wired nodes) and 3D routers (or wireless nodes) the proposed router can improve performance efficiency in terms of average packet delay by an average of 45% (or 50%) in 3D NoCs (or WiNoCs)

    Savior: A Reliable Fault Resilient Router Architecture for Network-on-Chip

    Full text link
    [EN] The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other hand, it becomes the reason for faults. A generic router consists of the buffers and pipeline stages. A single fault may result in an undesirable situation of degraded performance or a whole chip may stop working. Therefore, it is necessary to provide permanent fault tolerance to all the components of the router. In this paper, we propose a mechanism that can tolerate permanent faults that occur in the router. We exploit the fault-tolerant techniques of resource sharing and paring between components for the input port unit and routing computation (RC) unit, the resource borrowing for virtual channel allocator (VA) and multiple paths for switch allocator (SA) and crossbar (XB). The experimental results and analysis show that the proposed mechanism enhances the reliability of the router architecture towards permanent faults at the cost of 29% area overhead. The proposed router architecture achieves the highest Silicon Protection Factor (SPF) metric, which is 24.4 as compared to the state-of-the-art fault-tolerant architectures. It incurs an increase in latency for SPLASH2 and PARSEC benchmark traffics, which is minimal as compared to the baseline router.This work was supported by the Spanish 'Ministerio de Ciencia Innovacion y Universidades' and FEDER program in the framework of the 'Proyectos de I+D d Generacion de Conocimiento del Programa Estatal de Generacion de Conocimiento y Fortalecimiento Cientifico y Tecnologico del Sistema de I+D+i, Subprograma Estatal de Generacion de Conocimiento' (ref: PGC2018-095747-B-I00).Hussain, A.; Irfan, M.; Baloch, NK.; Draz, U.; Ali, T.; Glowacz, A.; Dunai, L.... (2020). Savior: A Reliable Fault Resilient Router Architecture for Network-on-Chip. Electronics. 9(11):1-18. https://doi.org/10.3390/electronics9111783S118911Borkar, S. (1999). Design challenges of technology scaling. IEEE Micro, 19(4), 23-29. doi:10.1109/40.782564Latif, K., Rahmani, A.-M., Nigussie, E., Seceleanu, T., Radetzki, M., & Tenhunen, H. (2013). Partial Virtual Channel Sharing: A Generic Methodology to Enhance Resource Management and Fault Tolerance in Networks-on-Chip. Journal of Electronic Testing, 29(3), 431-452. doi:10.1007/s10836-013-5389-5Borkar, S. (2005). Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6), 10-16. doi:10.1109/mm.2005.110Ali, T., Noureen, J., Draz, U., Shaf, A., Yasin, S., & Ayaz, M. (2018). Participants Ranking Algorithm for Crowdsensing in Mobile Communication. ICST Transactions on Scalable Information Systems, 5(16), 154476. doi:10.4108/eai.13-4-2018.154476Ali, T., Draz, U., Yasin, S., Noureen, J., shaf, A., & Zardari, M. (2018). An Efficient Participant’s Selection Algorithm for Crowdsensing. International Journal of Advanced Computer Science and Applications, 9(1). doi:10.14569/ijacsa.2018.090154Poluri, P., & Louri, A. (2016). Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 27(10), 3058-3070. doi:10.1109/tpds.2016.2521641Valinataj, M., & Shahiri, M. (2016). A low-cost, fault-tolerant and high-performance router architecture for on-chip networks. Microprocessors and Microsystems, 45, 151-163. doi:10.1016/j.micpro.2016.04.009Kim, J., Nicopoulos, C., Park, D., Narayanan, V., Yousif, M. S., & Das, C. R. (2006). A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. ACM SIGARCH Computer Architecture News, 34(2), 4-15. doi:10.1145/1150019.1136487Polian, I., & Hayes, J. P. (2011). Selective Hardening: Toward Cost-Effective Error Tolerance. IEEE Design & Test of Computers, 28(3), 54-63. doi:10.1109/mdt.2010.120Mohammed, H., Flayyih, W., & Rokhani, F. (2019). Tolerating Permanent Faults in the Input Port of the Network on Chip Router. Journal of Low Power Electronics and Applications, 9(1), 11. doi:10.3390/jlpea9010011Wang, L., Ma, S., Li, C., Chen, W., & Wang, Z. (2017). A high performance reliable NoC router. Integration, 58, 583-592. doi:10.1016/j.vlsi.2016.10.016Shafique, M. A., Baloch, N. K., Baig, M. I., Hussain, F., Zikria, Y. B., & Kim, S. W. (2020). NoCGuard: A Reliable Network-on-Chip Router Architecture. Electronics, 9(2), 342. doi:10.3390/electronics9020342Poluri, P., & Louri, A. (2015). A Soft Error Tolerant Network-on-Chip Router Pipeline for Multi-Core Systems. IEEE Computer Architecture Letters, 14(2), 107-110. doi:10.1109/lca.2014.2360686Feng, C., Lu, Z., Jantsch, A., Zhang, M., & Xing, Z. (2013). Addressing Transient and Permanent Faults in NoC With Efficient Fault-Tolerant Deflection Router. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6), 1053-1066. doi:10.1109/tvlsi.2012.2204909Liu, J., Harkin, J., Li, Y., & Maguire, L. P. (2016). Fault-Tolerant Networks-on-Chip Routing With Coarse and Fine-Grained Look-Ahead. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35(2), 260-273. doi:10.1109/tcad.2015.2459050Runge, A. (2015). FaFNoC: A Fault-tolerant and Bufferless Network-on-chip. Procedia Computer Science, 56, 397-402. doi:10.1016/j.procs.2015.07.226Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., … Wood, D. A. (2011). The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2), 1-7. doi:10.1145/2024716.202471
    • …
    corecore