161 research outputs found

    CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

    Get PDF
    Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

    Extending the performance of hybrid NoCs beyond the limitations of network heterogeneity

    Get PDF
    To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects (wireline), alternative interconnect fabrics such as inhomogeneous three-dimensional integrated Network-on-Chip (3D NoC) and hybrid wired-wireless Network-on-Chip (WiNoC) have emanated as a cost-effective solution for emerging System-on-Chip (SoC) design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers and wireless nodes. Moreover, the non-uniform distributed traffic in chip multiprocessor (CMP) demands an on-chip communication infrastructure which can avoid congestion under high traffic conditions while possessing minimal pipeline delay at low-load conditions. To this end, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in such emerging hybrid NoCs. The proposed router transmits a flit using dimension-ordered routing (DoR) in the bypass datapath at low-loads. When the output port required for intra-dimension bypassing is not available, the packet is routed adaptively to avoid congestion. The router also has a simplified virtual channel allocation (VA) scheme that yields a non-speculative low-latency pipeline. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able balance the traffic in hybrid NoCs to achieve low-latency communication under various traffic loads. Simulation shows that, the proposed router can reduce applications’ execution time by an average of 16.9% compared to low-latency routers such as SWIFT. By reducing the latency between 2D routers (or wired nodes) and 3D routers (or wireless nodes) the proposed router can improve performance efficiency in terms of average packet delay by an average of 45% (or 50%) in 3D NoCs (or WiNoCs)

    Floorplan-Aware High Performance NoC Design

    Full text link
    Las actuales arquitecturas de m�ltiples n�cleos como los chip multiprocesadores (CMP) y soluciones multiprocesador para sistemas dentro del chip (MPSoCs) han adoptado a las redes dentro del chip (NoC) como elemento -ptimo para la inter-conexi-n de los diversos elementos de dichos sistemas. En este sentido, fabricantes de CMPs y MPSoCs han adoptado NoCs sencillas, generalmente con una topolog'a en malla o anillo, ya que son suficientes para satisfacer las necesidades de los sistemas actuales. Sin embargo a medida que los requerimientos del sistema -- baja latencia y alto rendimiento -- se hacen m�s exigentes, estas redes tan simples dejan de ser una soluci-n real. As', la comunidad investigadora ha propuesto y analizado NoCs m�s complejas. No obstante, estas soluciones son m�s dif'ciles de implementar -- especialmente los enlaces largos -- haciendo que este tipo de topolog'as complejas sean demasiado costosas o incluso inviables. En esta tesis, presentamos una metodolog'a de dise-o que minimiza la p�rdida de prestaciones de la red debido a su implementaci-n real. Los principales problemas que se encuentran al implementar una NoC son los conmutadores y los enlaces largos. En esta tesis, el conmutador se ha hecho modular, es decir, formado como uni-n de m-dulos m�s peque-os. En nuestro caso, los m-dulos son id�nticos, donde cada m-dulo es capaz de arbitrar, conmutar, y almacenar los mensajes que le llegan. Posteriormente, flexibilizamos la colocaci-n de estos m-dulos en el chip, permitiendo que m-dulos de un mismo conmutador est�n distribuidos por el chip. Esta metodolog'a de dise-o la hemos aplicado a diferentes escenarios. Primeramente, hemos introducido nuestro conmutador modular en NoCs con topolog'as conocidas como la malla 2D. Los resultados muestran como la modularidad y la distribuci-n del conmutador reducen la latencia y el consumo de potencia de la red. En segundo lugar, hemos utilizado nuestra metodolog'a de dise-o para implementar un crossbar distribuidRoca Pérez, A. (2012). Floorplan-Aware High Performance NoC Design [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/17844Palanci

    Networks on Chips: Structure and Design Methodologies

    Get PDF
    corecore