10 research outputs found
FlexVC: Flexible virtual channel management in low-diameter networks
Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.This work has been supported by the Spanish Government (grant SEV2015-0493 of the Severo Ochoa Program), the Spanish Ministry of Economy, Industry and Competitiveness
(contracts TIN2015-65316), the Spanish Research Agency (AEI/FEDER, UE - TIN2016-76635-C2-2-R), the Spanish
Ministry of Education (FPU grant FPU13/00337), the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-
SGR-1272), the European Union FP7 programme (RoMoL ERC Advanced Grant GA 321253), the European HiPEAC Network of Excellence and the European Union’s Horizon
2020 research and innovation programme (Mont-Blanc project under grant agreement No 671697).Peer ReviewedPostprint (author's final draft
Reliability aware NoC router architecture using input channel buffer sharing
To address the increasing demand for reliability in on-chip networks, we proposed a novel Reliability Aware Virtual channel (RAVC) NoC router micro-architecture that enables both dynamic virtual channel allocations and the rational sharing among the buffers of different input channels. In particular, in the case of failure in routers, the virtual channels of routers surrounding the faulty routers can be totally recaptured and reassigned to other input ports. Moreover, our proposed RAVC router isolates the faulty router from occupying network bandwidth. Experimental result shows that proposed micro-architecture provides 7.1 % and 3.1 % average latency decrease under uniform and transpose traffic pattern. Considering the existence of failures in routers of on-chip network, RAVC provides 28 % and 16 % decrease in the average packet latency under the uniform and transpose traffic pattern respectively
A verilog-hdl implementation of virtual channels in a network-on-chip router
As the feature size is continuously decreasing and integration density is increasing,
interconnections have become a dominating factor in determining the overall
quality of a chip. Due to the limited scalability of system bus, it cannot meet the
requirement of current System-on-Chip (SoC) implementations where only a limited
number of functional units can be supported. Long global wires also cause many
design problems, such as routing congestion, noise coupling, and difficult timing closure.
Network-on-Chip (NoC) architectures have been proposed to be an alternative
to solve the above problems by using a packet-based communication network. The
processing elements (PEs) communicate with each other by exchanging messages over
the network and these messages go through buffers in each router. Buffers are one of
the major resource used by the routers in virtual channel flow control.
In this thesis, we analyze two kinds of buffer allocation approaches, static and
dynamic buffer allocations. These approaches aim to increase throughput and minimize
latency by means of virtual channel flow control. In statically allocated buffer
architecture, size and organization are design time decisions and thus, do not perform
optimally for all traffic conditions. In addition, statically allocated virtual channel
consumes a waste of area and significant leakage power. However, dynamic buffer allocation
scheme claims that buffer utilization can be increased using dynamic virtual
channels. Dynamic virtual channel regulator (ViChaR), have been proposed to use
centralized buffer architecture which dynamically allocates virtual channels and buffer slots in real-time depending on traffic conditions. This ViChaR’s dynamic buffer management
scheme increases buffer utilization, but it also increases design complexity. In
this research, we reexamine performance, power consumption, and area of ViChaR’s
buffer architecture through implementation. We implement a generic router and a
ViChaR architecture using Verilog-HDL. These RTL codes are verified by dynamic
simulation, and synthesized by Design Compiler to get area and power consumption.
In addition, we get latency through Static Timing Analysis. The results show that a
ViChaR’s dynamic buffer management scheme increases the latency and power consumption
significantly even though it could increase buffer utilization. Therefore, we
need a novel design to achieve high buffer utilization without a loss
Efficient bypass mechanisms for low latency networks on-chip
RESUMEN: La importancia de las redes en-chip en los procesadores multi-núcleo es cada vez mayor. Los routers con baipás son una solución eficiente para reducir la latencia de estas redes. Existen dos tipos de redes con baipás: single-hop y multi-hop. Las redes con baipás single-hop minimizan la latencia individual de cada router al asignar los recursos del router con antelación a la recepción de los paquetes. Las redes con baipás multi-hop, conocidas como SMART, permiten que los paquetes atraviesen múltiples routers en un único ciclo.
La primera propuesta de esta tesis es Non-Empty Buffer Bypass (NEBB), un mecanismo que incrementa la utilización del baipás de tipo single-hop, eliminando la necesidad de usar canales virtuales.
Para redes con baipás multi-hop propone SMART++ y S-SMART++. SMART++ elimina la necesidad de SMART de usar una gran cantidad de canales virtuales para aprovechar el ancho de banda de la red, permitiendo el diseño de configuraciones de bajo coste. S-SMART++ hace uso de la asignación de recursos de forma especulativa para preparar el baipás de tipo multi-hop. Este mecanismo reduce la latencia y su dependencia con la longitud máxima de los saltos de tipo multi-hop, aspecto clave para su viabilidad en diseños reales.
La contribución final es un conjunto de herramientas de código abierto llamada Bypass Simulation Toolset (BST) compuesto por versiones extendidas de BookSim y OpenSMART, una API para integrar BookSim en otros simuladores y una serie de scripts para facilitar el diseño y evaluación de este tipo de redes.ABSTRACT: Networks on-Chip (NoCs) are becoming more important in many-core processors as the number of cores grows. Bypass routers are an efficient solution that skips pipeline stages. There are two types of bypass mechanisms: single-hop and multi-hop bypass. Single-hop bypass minimizes the router delay by skipping allocation stages in each hop. Multi-hop bypass, called SMART, minimizes the effective number of hops by traversing multiple routers in a single cycle.
The first proposal of this dissertation is Non-Empty Buffer Bypass (NEBB) for single-hop bypass, which increases the bypass utilization without requiring VCs to match traditional bypass routers.
It proposes SMART++ and S-SMART++ for multi-hop bypass. SMART++ removes the requirement of using multiple VCs of SMART to exploit the bandwidth of the network, enabling low-cost configurations. S-SMART++ relies on speculative allocation to set up multi-hop bypass paths. Thus, it reduces latency and its dependency with the maximum length of multi-hops, relaxing the requirements to integrate multi-hop bypass in real designs.
The final contribution is an open-source set of tools to simulate bypass NoCs called Bypass Simulation Toolset (BST) conformed by extended versions of BookSim and OpenSMART, an API to integrate BookSim in other simulators, and scripts to simplify the designing and evaluation of such NoCs.This work was supported by the Spanish Ministry of Science, Innovation and Universities, FPI grant BES-2017-079971, and contracts TIN2010-21291-C02-02, TIN2013- 46957-C2-2-P, TIN2015-65316-P, TIN2016-76635-C2-2-R (AEI/FEDER, UE) and TIC PID2019-105660RB-C22; the European HiPEAC Network of Excellence; the European Community's Seventh Framework Programme (FP7/2007-2013), under the Mont-Blanc 1 and 2 projects (grant agreements n 288777 and 610402); the European Union's Horizon 2020 research and innovation programme under the Mont-Blanc 3 project (grant agreement nº 671697). Bluespec Inc. provided access to Bluespec tools
On the design of a high-performance adaptive router for CC-NUMA multiprocessors
Copyright © 2003 IEEEThis work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade off between network performance and hardware cost. The outcome of this research is a High-Performance Adaptive Router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input buffering, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. The throughput gains ranged from 10 percent to 40 percent in respect to its most direct rival, which employs more hardware resources. Other results shown that HPAR achieves up to 83 percent of its theoretical maximum throughput under random traffic and up to 70 percent when running real applications. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors.Valentín Puente, José-Ángel Gregorio, Ramón Beivide, and Cruz Iz
MOTIM – An Industrial Application Using NOCs
High-speed networks used to interconnect computers advance at an extraordinary pace, driven by the evolution of several contributing technologies. Due to the ever-increasing complexity of designing parts and equipments for these networks, design complexity management makes scalability and reusability more important issues than performance, in most cases. This paper describes MOTIM, a scalable and reusable architecture enabling the implementation of Ethernet switches with low latency and high throughput. The architecture is built around a network-on-chip-based switch fabric, which guarantees scalability. The architecture has been validated by functional simulation and prototyped in FPGAs. The experimental results show that even under severe traffic conditions the architecture achieves packet transmission with low latencies. Categories and Subject Descriptor
Throughput-Efficient Network-on-Chip Router Design with STT-MRAM
As the number of processor cores on a chip increases with the advance of CMOS technology, there has been a growing need of more efficient Network-on-Chip (NoC) design since communication delay has become a major bottleneck in large-scale multicore systems. In designing efficient input buffers of NoC routers for better performance and power efficiency, Spin-Torque Transfer Magnetic RAM (STT-MRAM) is regarded as a promising solution due to its nature of high density and near-zero leakage power. Previous work that adopts STT-MRAM in designing NoC router input buffer shows a limitation in minimizing the overhead of power consumption, even though it succeeds to some degree in achieving high network throughput by the use of SRAM to hide the long write latency of STT-MRAM.
In this thesis, we propose a novel input buffer design that depends solely on STT-MRAM without the need of SRAM to maximize the benefits of low leakage power and area efficiency inherent in STT-MRAM. In addition, we introduce power-efficient buffer refreshing schemes synergized with age-based switch arbitration that gives higher priority to older flits to remove unnecessary refreshing operations. On an average, we observed throughput improvements of 16% on synthetic workloads and benchmarks
Hardware Support for Efficient Packet Processing
Scalability is the key ingredient to further increase the performance of today’s supercomputers.
As other approaches like frequency scaling reach their limits, parallelization is the
only feasible way to further improve the performance. The time required for communication
needs to be kept as small as possible to increase the scalability, in order to be able to
further parallelize such systems.
In the first part of this thesis ways to reduce the inflicted latency in packet based interconnection
networks are analyzed and several new architectural solutions are proposed to
solve these issues. These solutions have been tested and proven in a field programmable
gate array (FPGA) environment. In addition, a hardware (HW) structure is presented that
enables low latency packet processing for financial markets.
The second part and the main contribution of this thesis is the newly designed crossbar
architecture. It introduces a novel way to integrate the ability to multicast in a crossbar
design. Furthermore, an efficient implementation of adaptive routing to reduce the
congestion vulnerability in packet based interconnection networks is shown. The low
latency of the design is demonstrated through simulation and its scalability is proven with
synthesis results.
The third part concentrates on the improvements and modifications made to EXTOLL, a
high performance interconnection network specifically designed for low latency and high
throughput applications. Contributions are modules enabling an efficient integration of
multiple host interfaces as well as the integration of the on-chip interconnect. Additionally,
some of the already existing functionality has been revised and improved to reach better
performance and a lower latency. Micro-benchmark results are presented to underline the
contribution of the made modifications