Abstract-Common distributed shared memory systems using a directory-based protocol operate with unicast messages for write invalidations. The unicast messages serialize the write invalidation transactions, which leads to increased network traffic and latency. This paper proposes an efficient multicast router for a single-flit write invalidation message in on-chip networks. A tree-based routing scheme is followed for multicast routing with a bit-string multidestination encoding. We implemented the tree-based write invalidation router targeting IBM 90nm technology. In network simulation, the proposed design demonstrated 10.5% reduced latency and 3.2% less energy consumption than the unicast and dual-path router.
INTRODUCTION
The computing power available in a many-core chip is delivered through an effective communication scheme. The on-chip interconnection network, which determines the communication scheme, is critical to performance with respect to bandwidth utilization and latency. The bandwidth demands are tightly coupled to the number of network transactions generated, whereas the latency is governed by the number of these transactions that are in the critical path [6] . These primary limiters of latency and bandwidth to the performance can be mitigated with an efficient routing scheme which supports multicast.
In distributed shared memory (DSM) systems, directorybased protocols are often used as a scalable design due to their point-to-point communication nature. The directory protocol maintains the possible sharers associated with an entry of its home node. When a node issues a write request to the home node for a write miss, the home node sends invalidation request messages to all sharers. Once the invalidation acknowledgements are received from the sharers, the home node replies granting ownership of the block to the requester. Even though the above transactions are intended to be performed serially, the protocol allows invalidation requests from a write miss delivered concurrently. However, at the time of packet generation, each request at the home node may serialize the transaction. This packet serialization exacerbates the miss latency and causes inefficient use of bandwidth by generating unnecessary multiple messages in the network. In addition, the multiple packets increase the contention at the injection port of the router and occupancy of the home node and thus creating hot-spots [1] .
To reduce the latency of the write operation, several previous works suggested multicast communication for write invalidation. [1] proposed multidestination-based reservation and gather worms in 2D networks. While the reserve worm delivers the invalidation message, it reserves an acknowledgement entry in each router interface. The gather worm then collects the acknowledgements passing through the routers. The set of the destinations of a multicast message is partitioned as an up-and-down column grouping; thus, a home node requires at most two reserve worms in each column for invalidation requests. [2] & [3] proposed dual-path and multipath algorithms using Hamiltonian path multicast routing. The source node partitions the destinations into several ordered lists and multiple message worms then visit destinations in a predefined order through disjoint paths.
The presented idea in this paper uses a tree-based multicast scheme with a single-flit packet. The destination sets are partitioned by the four directions (NEWS), and all destination nodes are reached through a shortest path, in contrast to pathbased approaches, which suffer from long path latencies. A single packet is injected to the network to cover all destinations, and the packet is replicated to be forwarded to requested outputs at router branches. The write invalidation message consists of a single-flit packet using a bit-string encoding for the multidestination routing header. Even though the prior tree-based multicast schemes assume small sized flits and multi-flit packets for write invalidation messages, we pack the multidestination addresses in a single flit. This capability is enabled in a network-on-chip (NoC) environment where wiring resources are abundant. The bit-string encoding is a function of the number of reachable nodes from a home node, and the length is independent of the number of multicast destinations. If the destination node is out of range of a reachable network, a level of indirection is used where the destination node is covered with a unicast packet which has a binary destination ID. Therefore, the presented router in this paper supports two types of destination address encoding schemes: binary encoding of destination ID for unicast routing and bit-string for multicast routing. Since a sender initiates the multidestination packets, the bit-string with sharers' information can be easily extracted from the directory, making the communication start-up process simple.
In this paper, we present an efficient routing decision scheme using bit-string encoding and a mechanism of singleflit packet replication for multicast messages. The presented idea then is compared to a dual-path based router and unicast router using the circuit simulation. The above three designs are all implemented with synthesizable VHDL code, and physical layouts are generated. From the circuit simulation, the average powers and energy consumptions are measured with a small synthetic workload. The result shows that the tree-based write invalidation multicast is superior to the other two designs by 10.5% and 3.2% for latency and energy consumption, respectively.
II. INVALIDATION SCHEMES
In this section, we briefly review the three invalidation schemes mentioned in the previous section. As shown in Figure 1 (a), unicast routing (called baseline routing in the rest of the paper) sends multiple packets to cover all the sharers. The baseline routing scheme spends more time to create multiple packets than the other two designs. Figure 1 (b) shows dual-path based routing. The home node sends the multicast packets in two groups along directed Hamiltonian paths. Each path has destinations that can be reached from the source node without cyclic directed dependency. However, dual-path based routing visits all intermediate nodes until it reaches the last destination; therefore, it results in long latency. Figure 1 (c) shows tree-based multicast routing. The multidestination packet is routed along a common path and the packet is copied into different channels at branches. Treebased routing is susceptible to the deadlock in wormholerouted networks because of the branching. [4] proposed a treebased multicast that solved deadlocks through a pruning mechanism. Our proposed idea implements tree-based multicast, but we assumed single-flit packets for the multicast message since wire availability in NoC enables the use of single-flit packets. The single-flit multidestination packet never requests multiple channels in a single cycle in the proposed scheme, so it is not susceptible to deadlock. 
III. PROPOSED ROUTER ARCHITECTURE

A. Header Flit Format
A good multi-address encoding scheme is needed to minimize the message header length overhead and ease routing decisions. Bit-string encoding is a good option to limit the size of a header in a small network, but it increases the overhead when the number of destinations is large. The proposed idea uses bit-string encoding for a reachable network limited to 16 nodes. For networks larger than this, a level of indirection is needed where a unicast packet must be sent. Figure 2 shows header flit formats with the bit-string encoding for multicast packets and binary destination encoding for unicast packets. The bit-string is a simple bit vector, where each bit corresponds to a destination of the bit position. A type field determines the position of a flit in a packet as head, body, tail and head-tail. The head-tail flit indicates a single-flit packet which is used as a write invalidation in this paper. Since the head flit carries the packet's routing information, the head flit is handled differently from the body and tail flits. The head flit allocates channel state for the packet and places the acquired channel ID in the virtual channel ID (VCID) field. Body and tail flits have no routing or sequencing information, so they simply follow the head flit along its route in-order. 
B. Pipeline Stages and Routing Computation
This section briefly summarizes the pipeline stages ( Figure  3 ) of the presented router. After arrival of the head flit, the routing computation decides an output port to which the packet must be forwarded. The result of the routing computation is used for switch allocation (SA) and virtual channel allocation (VA) at the next cycle. VA is accomplished simply by finding a free output VC from a SA winner as described in [8] . The packet is then finally transferred to the next router after traversing the crossbar. At the routing computation stage, to decode the output port for the destination nodes, a router should be aware of the network topology and the location of itself. The presented router partitions the bit-string according to the direction based on the router location. Figure 4 shows how the multidestination packet is copied and forwarded to different directions, considering a multicast message from router 5 to routers 3, 4, 10 and 15 in a 4x4 mesh network. A single packet that includes all destination nodes is generated in source node 5 and injected to the local port of the router. At the time of a flit arrival in the input buffer, the routing computation process partitions the destinations by directions. Each router is required to know the topology and the location inside the network. From Figure 4 (a), source router 5 separates the direction easily based on the bit position of node 5 in the bitstring. All the right side nodes are hard-wired to the east output port encoding, left nodes to west, upside nodes to north and downside nodes to south. After the routing computation, the produced multiport bit-string information is fed to a route field of input VC state fields per cycle. From Figure 4 (b) , the first output port encoding (10000) is forwarded in the route field of the input VC state at cycle 2. When SA and VA complete, a flit is duplicated and forwarded to the east port at the ST stage with updated destination nodes (Figure 4 (c) ). At the next cycle, the last flit is forwarded to the west port based on the information of the output port encoding (01000). The multidestination packet is forwarded to the proper output port one by one in a pipelined fashion until all requested output ports are satisfied. 
IV. EVALUATION
In the previous section, we discussed the pipeline stages of the proposed router architecture. This section describes performance and power analyses of the implemented designs. A unicast baseline, dual-path multicast and the proposed treebased multicast router were developed in synthesizable VHDL code. The codes were synthesized using Synopsys Design Compiler, and layouts were generated with Cadence SOC Encounter targeting the Artisan standard cell library for IBM 90 nm technology. Table I summarizes the common router features and network parameters for synthesis and circuit simulation. Each of the entries for the Routing row corresponds to baseline, dual-path and tree-based router, respectively. Table II presents the area and timing results of the above three designs. The tree-based router takes 16.6% more cell area than the unicast baseline router and 2.2% less area than the dual-path router. The dual-path router suffers from a complicated routing computation and related input VC control logic as compared to the other two designs. The encoded bitstring is more difficult to decode for routing, especially for path-based routers. Therefore, the dual-path router needs a predefined ordered list for destinations, but the home node needs an extra preparation phase to set up the order of the destinations. On the other hand, the tree-based router decodes the bit-string relatively easily as mentioned in the previous section. The two types of destination address encoding schemes causes slightly more area and clock period than the baseline router, but this overhead is trivial when overall network energy consumption and performance benefits are considered, as shown in the following simulation results. The performance of the three router designs is evaluated through a small synthetic workload. The inserted network loads are a mix of unicast and multicast packets by 94% and 6%, which is close to a 5.1% multicast rate in the directory protocol [10] . The unicast packet is assumed to have 8 flits and multicast packets cover all destinations in the network for targets. A total of 256 packets are inserted in each node to the network. In the baseline router, all multicast packets are translated into multiple individual unicast packets; thus, a higher number of packets are inserted in the baseline router. Table III describes various performance results of the three designs with the parameters presented in Table 1 . The treebased router outperforms the baseline and dual-path router in terms of latency and throughput. At the saturation point, the tree-based router shows 10.5% latency reduction and 13.5% at no-load when compared to the baseline. The dual-path router exhibits the least latency at no-load because it takes the shortest occupancy of time by sending multicast packets in two disjoint directions in this workload. However, the dualpath router shows poor performance on latency at the maximum injection rate. Stopping by all intermediate destination nodes leads to longer latency than the baseline router. Even though the workload used in this evaluation is synthetic, it possesses many of the qualities we expect in application-driven workloads. We leave the router evaluation for an application-driven workload as a future work. Using the same workload, the average power consumption of each router and total network energy consumption for the 4x4 mesh was measured. Power analysis was performed using Synopsys Nanosim based on layout extraction including wire delays. As can be seen in Table III , the tree-based router consumes more average power than the baseline by 2.7%. However, the power metric alone does not imply the least average power design consumes the least energy for overall simulation. From the observed energy consumption metric, the tree-based router consumes less energy than the baseline and dual-path routers. The presented total energy consumption reflects pure network energy under the workload, and it should be considered as a primary design constraint in on-chip networks. With the current trend that energy is becoming a more important factor in chip design, the proposed tree-based router is an attractive solution for on-chip routing. The treebased multicast router leads to less latency, and the reduced latency yields substantial energy savings on overall chip power under the workload.
Finally, Figure 5 shows the layout picture of the three implemented routers. The pictures are sized proportional to the dimension shown in Table II. In this paper, we presented a tree-based write invalidation multicast router for networks-on-chip. The scheme reduces the write invalidation latency and number of packets generated. We implemented the router targeting IBM 90nm technology, and the analysis shows that the proposed idea outperformed other designs both in performance and energy consumption by 10.5% and 3.2%, respectively.
