Abstract-Short and long packets co-exist in cache-coherent NoCs. Existing designs for torus networks do not efficiently handle variable-size packets. For deadlock free operations, a design uses two VCs, which negatively affects the router frequency. Some optimizations use one VC. Yet, they regard all packets as maximum-length packets, inefficiently utilizing the precious buffers. We propose flit bubble flow control (FBFC), which maintains one free flit-size buffer slot to avoid deadlock. FBFC uses one VC, and does not treat short packets as long ones. It achieves both high frequency and efficient buffer utilization. FBFC performs 92.8 and 34.2 percent better than LBS and CBS for synthetic traffic in a 4 Â 4 torus. The gains increase in larger networks; they are 107.2 and 40.1 percent in an 8 Â 8 torus. FBFC achieves an average 13.0 percent speedup over LBS for PARSEC workloads. Our results also show that FBFC is more power efficient than LBS and CBS, and a torus with FBFC is more power efficient than a mesh.
Ç

INTRODUCTION
O PTIMIZING NoCs [19] based on coherence traffic is necessary to improve the efficiency of many-core coherence protocols [41] . The torus is a good NoC topology candidate [52] , [53] . The wraparound links convert plentiful on-chip wires into bandwidth [19] , and reduce hop counts and latencies [52] . Its node-symmetry helps to balance network utilization [52] , [53] . Several products [21] , [28] , [32] use a ring or 1D torus NoC. Also, the 2D or high dimensional torus is widely used in off-chip networks [4] , [18] , [44] , [51] .
Despite the many desirable properties of a torus, additional effort is needed to handle deadlock due to cyclic dependencies introduced by wraparound links. A deadlock avoidance scheme should support high performance with low overhead. Requiring minimum VCs [16] is preferable, because more VCs increase the router complexity. Buffers are a precious resource [24] , [46] ; an efficient design should maximize performance with limited buffers. There is a gap between existing proposals and these requirements.
A conventional design [20] uses two VCs to remove cyclic dependencies; this introduces large allocators and hurts the router frequency. Optimizations [10] , [11] for virtual cut-through (VCT) networks [33] avoid deadlock by preventing the use of the last free packet-size buffer inside rings; only one VC is needed. However, with variable-size packets, each packet must be regarded as a maximumlength packet [4] . This restriction prevents deadlock, but results in poor buffer utilization and performance, especially for short packet dominating coherence traffic.
In addition to the majority short packets, cache-coherent NoCs also deliver long packets. Even though multiple virtual networks (VNs) [20] may be configured to avoid protocol-level deadlock, these two types of packets still co-exist in a single VN. For example, short read requests and long write-back requests are sent in VN0 of AlphaServer GS320, while long read responses and short write-back acknowledgements are sent in VN1; both VNs carry variable-size packets [25] . Similarly, all VNs in DASH [37] , Origin 2000 [36] , and Piranha [6] deliver variable-size packets.
With a typical 128-bit NoC flit width [24] , [39] , [46] , the majority control packets have one flit; the remaining data packets contain a 64B cache line and have five flits. Fig. 1 shows the packet length distribution of some PARSEC workloads [8] with two coherence protocols 1 . Both protocols use four VNs. For each protocol, two VNs carry variablesize packets, and the other two have only short packets. The single-flit packet (SFP) ratios of VN0 in the MOESI directory [45] , and VN2 and VN3 in the AMD's Hammer [15] are all higher than 90 percent. With such high SFP ratios, regarding all packets as maximum-length packets strongly limits buffer utilization. As shown in Section 6.2, existing designs' buffer utilization in saturation is less than 40 percent. This brings large performance loss.
To address existing designs' limitations, we propose a novel deadlock avoidance theory, flit bubble flow control (FBFC), for torus NoCs. FBFC leverages wormhole flow control [18] . It avoids deadlock by maintaining one free flit-size buffer slot inside a ring. Only one VC is needed, reducing the allocator size and improving the frequency. Furthermore, short packets are not regarded as long packets in FBFC, leading to high buffer utilization. Based on this theory, we provide two implementations: FBFC-L and FBFC-C.
Experimental results show that FBFC outperforms dateline [20] , LBS [10] and CBS [11] . FBFC achieves a $30 percent higher router frequency than dateline. For synthetic traffic, FBFC performs 92.8 and 34.2 percent better than LBS and CBS in a 4 Â 4 torus. FBFC's advantage is more significant in larger networks; these gains are 107.2 and 40.1 percent in an 8 Â 8 torus. FBFC achieves an average 13.0 percent and maximal 22.7 percent speedup over LBS for PARSEC workloads. FBFC's gains increase with fewer buffers. The power-delay product (PDP) results show that FBFC is more power efficient than LBS and CBS, and a torus with FBFC is more power efficient than a mesh. We make the following contributions:
Analyze the limitations of existing torus deadlock avoidance schemes, and show that they perform poorly in cache-coherent NoCs. Demonstrate that in wormhole torus networks, maintaining one flit-size free buffer slot can avoid deadlock, and propose the FBFC theory. Present two implementations of FBFC; both show substantial performance and power efficiency gains over previous proposals.
LIMITATIONS OF EXISTING DESIGNS
Here, we analyze existing designs. Avoiding deadlock inside a ring combined with dimensional order routing (DOR) is the general way to avoid deadlock in tori. We use the ring for discussion.
Dateline
As shown in Fig. 2 , dateline [20] avoids deadlock by leveraging two VCs: VC 0i and VC 1i . It forces packets to use VC 1i after crossing the dateline to form acyclic channel dependency graphs [17] , [20] . Dateline can be used in both packet-based VCT and flit-based wormhole networks. It uses two VCs, which results in larger allocators and lower router frequency.
Localized Bubble Scheme (LBS)
Bubble flow control [10] , [49] is a deadlock avoidance theory for VCT torus networks. It forbids the use of the last free packet-size amount of buffers (a packet-size bubble); only one VC is needed. Theoretically, any free packet-size bubble in a ring can avoid deadlock [10] , [49] . However, due to difficulties of gathering global information and coordinating resource allocation for all nodes, previous designs apply a localized scheme; a packet is allowed to inject only when the receiving VC has two free packet-size bubbles [10] , [49] . Fig. 3 gives an example. Here, three packets, P 0 , P 1 and P 2 , are waiting. Theoretically, they all can be injected. Yet, with a localized scheme, only P 0 can be injected since only VC 1 has two free packet-size bubbles. LBS requires each VC to be deep enough for two maximum length packets.
Critical Bubble Scheme (CBS)
Critical Bubble Scheme [11] marks at least one packet-size bubble in a ring as critical. A packet can be injected only if its injection will not occupy a critical bubble. Control signals between routers track the movement of critical bubbles. CBS reduces the minimum buffer requirement to one packet-size bubble. In the example of Fig. 4 , the bubble at VC 2 is marked as critical; P 2 can be injected. P 1 cannot be injected since its injection would occupy the critical bubble. Requiring that critical bubbles can be occupied only by packets in a ring guarantees that there is at least one free bubble to avoid deadlock. When P 3 advances into VC 2 , the critical bubble moves into VC 1 . Now, VC 1 maintains one free bubble.
Inefficiency with Variable-Size Packets
LBS and CBS are proposed for VCT networks; they are efficient for constant-size packets. Yet, as observed by the BlueGene/L team, LBS deadlocks with variable-size packets due to bubble fragmentation [4] . Fig. 5 shows an example with one-flit packets and two-flit packets. A free full-size (twoslot) bubble exists in VC 2 at cycle 0. When P 0 moves into VC 2 , the bubble is fragmented across VC 1 and VC 2 . VCT reallocates a VC only if it has enough space for an entire packet. Since VC 2 's free buffer size is less than P 1 's length, P 1 cannot advance and deadlock results. CBS has a similar problem. To handle this issue, BlueGene/L regards each packet as a maximum-length packet [4] . Now, there is no bubble fragmentation. However, this reduces buffer utilization, especially in coherence traffic, whose majority is short packets.
FLIT BUBBLE FLOW CONTROL
We first propose the FBFC theory. Then, we give two implementations. Finally, we discuss starvation.
Theoretical Description
We notice that maintaining one free flit-size buffer slot can avoid deadlock in wormhole networks. This insight leverages a property of wormhole flow control: advancing a packet with wormhole does not require the downstream VC to have enough space for the entire packet [18] . To show this in Fig. 6 , a free buffer slot exists in VC 2 at cycle 0; P 0 advances at cycle 1. A free slot is created in VC 1 due to P 0 's movement. Similarly, P 3 's head flit moves to VC 1 at cycle 2, creating a free slot in VC 0 . This free buffer slot cycles inside the ring, allowing all flits to move.
The packet movement in a ring does not reduce free buffer amounts since forwarding one flit leaves its previously occupied slot free; only injection reduces free buffer amounts. The theory is declared as follows. Theorem 1. If packet injection maintains one free buffer slot inside a ring, there is no deadlock with wormhole flow control.
Proof Sketch: A deadlock configuration in wormhole networks involves a set of cyclically dependent flits where no flit can move [17] . In a ring, a cyclic dependency needs the participation of all VCs. Thus, we only need to prove that a flit in any VC can advance.
Proof. Assume there is only one free buffer slot at VC iþ1 and all other VCs are full. We label VC iþ1 's upstream VC in the ring as VC i . There are two possible situations for the flit f at VC i 's head.
1) f is a head flit. If f arrives at the destination, it can be ejected. If f needs to advance into VC iþ1 , we consider the packet P k which most recently utilized VC iþ1 . Again, there are two possible situations. 1.1) P k was forwarded from VC i into VC iþ1 . Since now the head flit f of another packet is at the head of VC i , P k 's tail flit has already advanced into VC iþ1 . f can advance with wormhole flow control. 1.2) P k was injected into VC iþ1 . Its tail flit must have already advanced into VC iþ1 . Otherwise, the tail flit will occupy the free buffer slot, which violates the premise that the injection procedure maintains one free buffer slot. f can advance. 2) f is a body or tail flit. It can be ejected or forwarded. In all cases, a flit can move. t u
Since one free buffer slot (flit bubble) 2 avoids deadlock, we call this theory flit bubble flow control. DOR removes the cyclic dependency across dimensions; combining DOR with FBFC avoids deadlock in tori. FBFC has no bubble fragmentation; its bubble is flit-size. Thus, FBFC does not regard each packet as a maximum-length packet. Only one VC is needed; this improves the frequency. FBFC uses wormhole to move packets inside a ring. It requires the injection procedure to leave one slot empty. Later, we show two schemes to satisfy this requirement.
FBFC-Localized (FBFC-L)
The key point in implementing FBFC is to maintain a free buffer slot inside each ring. We first give a localized scheme: FBFC-Localized. When combined with DOR, a dimension-changing packet is treated the same as an injecting packet. The rules of FBFC-L are as follows: (i) Forwarding of a packet within a dimension is allowed if the receiving VC has one free buffer slot. This is the same as wormhole. (ii) Injecting a packet (or changing its dimension) is allowed only if the receiving VC has one more free buffer slot than the packet length. This requirement ensures that after injection, one free buffer slot is left in the receiving VC to avoid deadlock. Fig. 7 shows an example. Three packets are waiting. The number of free slots in VC 2 and VC 3 are two and four; they are one more than the lengths of P 3 and P 4 , respectively. P 3 Fig. 6 . A wormhole routing example. 2. We use flit bubble and buffer slot interchangeably. and P 4 can be injected. After injection, at least two free slots are left in the ring. P 2 cannot be injected since VC 1 only has one free slot, which is equal to P 2 's length. However, according to wormhole flow control, the free slot in VC 1 allows P 1 's head flit to advance. In FBFC-L, each VC must have one more buffer slot than the size of the longest packet.
FBFC-Critical (FBFC-C)
To reduce the minimum buffer requirement, we propose a critical design: FBFC-Critical. FBFC-C marks at least one free buffer slot as a critical slot, and restricts this slot to only be occupied by packets traveling inside the ring. The rules of FBFC-C are as follows: (i) Forwarding of a packet within a dimension is allowed if the receiving VC has one free buffer slot, no matter if it is a normal or critical slot. (ii) Injecting a packet is only allowed if the receiving VC has enough free normal buffer slots for the entire packet. After injection, the critical slot must not be occupied. This requirement maintains one free buffer slot. Fig. 8 shows an example. At cycle 0, one critical buffer slot is in VC 1 . VC 2 and VC 3 have enough free normal slots to hold P 3 and P 4 respectively; P 3 and P 4 can be injected. They do not occupy the critical slot, indicating that the existence of a free slot (the critical slot) elsewhere in the ring. Since the only free slot in VC 1 is a critical one, P 2 cannot be injected. Yet, this critical slot allows P 1 's head flit to move. At cycle 1, P 1 's head flit advances into VC 1 , moving the critical slot backward into VC 0 . This is done by R0 asserting a signal to indicate to R3 that it should mark the newly freed slot in VC 0 as a critical one. More details are provided in Section 4. The minimum buffer requirement of FBFC-C is the same as CBS; a VC must can hold a largest packet. This is one slot less than FBFC-L.
The injection of FBFC-L and FBFC-C is similar to VCT; they require enough buffers for packets before injection. After injection, a minimum of one slot is left free for wormhole. They can be regarded as applying VCT for injection (or dimensionchanging) in wormhole networks. These hybrid schemes are straightforward ways to address existing designs' limitations.
Starvation
FBFC-L and FBFC-C must deal with starvation. The starvation in FBFC-L is intrinsically the same as in LBS [10] : Injecting packets need more buffers than inside-ring traveling packets. Fig. 9 shows a starvation example for FBFC-L. Here, if node R0 continually injects packets, such as P 0 , destined for R3, then P 1 cannot be injected. We design a starvation prevention mechanism; if a node detects starvation, it will notify all other nodes in a ring to stop injecting. A sideband network conveys the control signal ('starve'). Once blocked cycles of P 1 exceed a threshold value, R1 asserts the 'starve' signal. R0 stops injecting after receiving 'starve' and forwards it to R3. All nodes except R1 stop injecting. Finally, P 1 can be injected. Then, R1 deasserts 'starve' to resume other nodes' injection. To handle the corner case of multiple nodes simultaneously detecting starvation, the 'starve' carries a 'ID' field to differentiate the nodes of a ring. Since the sideband network is unblocking, a router can identify the sending time slot of 'starve' based on the 'ID' field. The 'ID' field and the sending time slot order 'starve' signals. If the incoming 'starve' has a higher order than the currently serving 'starve,' the router forwards the incoming signal to its neighbor.
FBFC-C has another starvation scenario in addition to the previous one; it is due to the critical bubble stall. CBS has a similar issue [13] . Fig. 10 shows that the critical bubble is in VC 3 at cycle 0. The bubble movement depends on the packet advancement. If all packets in VC 2 , such as P 0 , are destined for R2, they will be ejected. Since no packet moves to VC 3 , the critical bubble stalls at VC 3 . P 1 cannot be injected. This can be prevented by proactively transferring the critical bubble backward if the upstream VC has a free normal bubble. As shown in Fig. 10 , a pair of 'N2C' ('NormalToCritical') and 'C2N' ('CriticalToNormal') signals are used. If R2 detects that the critical bubble stall prohibits P 1 's injection, it asserts 'N2C' to R1. If VC 2 has a normal free bubble, R1 will change it into a critical one in cycle 1. The 'C2N' notifies R2 that the critical bubble in VC 3 can now be changed into a normal one. P 1 can be injected. Note that the bubble status is maintained at upstream routers.
ROUTER MICROARCHITECTURE
In this section, we discuss wormhole routers for FBFC. We also discuss VCT routers for LBS and CBS.
FBFC routers
The left side of Fig. 11 shows a canonical wormhole router, which is composed of the input units, routing computation (RC) logic, VC allocator (VA), switch allocator (SA), crossbar and output units [20] , [23] . Its pipeline includes: RC, VA, SA and switch traversal [20] , [23] . The output unit tracks downstream VC status. The 'input_vc' register records the allocated input VC of a downstream VC. The one-bit 'idle' register indicates whether the downstream VC receives the tail flit of last packet. 'Credits' records credit amounts. Lookahead routing [20] performs RC in parallel with VA. To be fair with VCT routers, wormhole routers try to hold SA grants for entire packets; it prioritizes VCs that got switch access previously [35] .
FBFC mainly modifies output units. As shown in the upper-right side of Fig. 11 , two one-bit registers, 'inj s ' and 'inj l ,' are needed for the bi-modal length coherence traffic. They record whether a downstream VC is available for injecting (or dimension-changing) short and long packets. When packets will be injected (or change dimensions), VA checks the appropriate register according to packet lengths. Single-flit packets require at least two credits in a downstream VC. A five-flit packet needs at least six credits. If the incoming 'starve' signal is asserted to prevent starvation for some other router, both registers are reset to forbid injection. Fig. 11 shows the logic. This logic can be pre-calculated and is off the critical path.
The lower-right side of Fig. 11 shows the output unit of FBFC-C router. Another register, 'CBs,' records critical flit bubble counts. The logic of 'inj s ' and 'inj l ' is modified; packet injection is only allowed if the downstream VC has enough free normal slots. Specifically, 'credits À CBs' is not less than the packet length. FBFC-C routers proactively transfer critical slots to prevent starvation. When there is an incoming 'N2C' signal, the output unit checks whether there are free normal slots. If there are, 'CBs' is increased by 1, and 'C2N' is asserted to inform the neighboring router to change its critical slot into a normal one. The 'mark_cb' signal is asserted when a flit will occupy a downstream critical slot; it informs the upstream router to mark the newly freed slot as critical. Similarly, this logic is off the critical path.
VCT Routers
We discuss VCT routers for LBS and CBS. A typical VCT router [20] , [22] is similar to the wormhole one shown in Fig. 11 . The main difference is VC allocation: VCT re-allocates a VC only if it guarantees enough space for an entire packet. The advance of a packet returns one credit, which represents the release of a packet-size amount of buffers. We apply some optimizations to favor LBS and CBS. The SA grant holds for an entire packet. Since VA guarantees enough space for an entire packet, once a head flit moves out, that packet's remaining flits can advance without interruption; the packet's all occupied buffers will be freed in limited time. Thus, a credit is returned once a head flit moves out. The lookahead credit return allows the next packet to use this VC even if there is only one free slot, overlapping the transmission of an incoming packet and an outgoing packet. This optimization brings an injection benefit for CBS which we discuss in Section 7.2. LBS router's output unit is similar to that of FBFC-L router. The difference is that the LBS router only needs one 'inj' register since all packets are regarded as long packets. The 'credits' register records free buffer slots in the unit of packets instead of flits. CBS router's output unit also has these differences. Since CBS only starves due to the critical bubble stall, there is no incoming 'starve' signal.
METHODOLOGY
We modify Booksim [30] to implement FBFC-L and FBFC-C to compare with dateline, LBS and CBS. We use both synthetic traffic and real applications. Synthetic traffic uses one VN since each VN is independent. The traffic has randomly injected one-flit packets and five-flit packets. The baseline single-flit packet ratio is 80 percent, which is similar to the overall SFP ratio of a MOSEI directory protocol. The warmup and measurement periods are 10,000 and 100,000 cycles.
Although FBFC works for high dimensional tori, we focus on 1D and 2D tori as they best match the physical layouts. The routing is DOR. Buffers are precious; most evaluation uses 10 slots at each port per VN. Bubble designs have one VC per VN. Dateline divides 10 slots into two VCs; five slots/VC covers credit round-trip delays [20] . Instead of injecting packets to VC 0i first (Fig. 2) , then switching to VC 1i after the dateline [20] , a balancing optimization is applied to favor dateline; injecting packets choose VCs according to whether they will cross dateline [51] . Packets use VC 1i s if they will cross dateline. Otherwise, they use VC 0i s. CBS and FBFC-C set one critical bubble for each ring; CBS marks five slots as a packet-size critical bubble, and FBFC-C marks one slot as a flit-size critical bubble. The starvation threshold values (STVs) in FBFC-L and LBS are 30 cycles. The STVs due to critical bubble stall in CBS and FBFC-C are three cycles.
VA and SA delays determine router frequencies [7] , [48] . Dateline uses two VCs per VN, resulting in large allocators and long critical paths. A technology-independent model [48] is used to calculate the delays, as shown in Table 1 . Separable input-first allocators [20] with matrix arbiters [20] are used. VA is independent for each VN [7] , making SA the critical path with multiple VNs. Dateline's SA delay with 4 VNs is $30 percent higher than bubble designs.
To measure full-system performance, we leverage FeS2 [45] for x86 simulation and BookSim for NoC simulation. FeS2 is a module for Simics [40] . We run PARSEC workloads [8] with 16 threads on a 16-core CMP. Since dateline's frequency can be different with bubble designs, we do not evaluate dateline for real workloads. The frequency of simple CMP core can be 2$4 GHZ, while the frequency of NoC is limited by allocator speeds [27] , [42] . We assume cores run 2Â faster than the NoC. Each core is connected to private, inclusive L1 and L2 caches. Cache lines are 64 bytes; long packets are five flits with a 16-byte flit width. We use a MOESI directory protocol to maintain the coherence among L2 caches; it uses four VNs to avoid protocol-level deadlock. Each VN has 10 slots. Workloads use simsmall input sets. The task runtime is the performance metric. Table 2 gives the configuration.
EVALUATION ON 1D TORI (RINGS)
Performance
Our evaluation for synthetic patterns [20] starts with an eight-node ring. As shown in Fig. 12 , FBFC-L is similar to FBFC-C; although FBFC-C needs one less slot for injection, this benefit is minor since 10 slots are used. FBFC obviously outperforms LBS and CBS. Across all patterns, the average saturation throughput 3 gains of FBFC-C over LBS and CBS are 73.5 and 33.9 percent. LBS and CBS are limited by regarding short packets as long ones. The advance of a short packet in FBFC-C uses one slot, while it uses five slots in LBS and CBS. LBS is also limited by its high injection buffer requirement; CBS shows an average 29.6 percent gain over LBS.
In Fig. 12 , dateline is superior to LBS and CBS. The results are reported in cycles and flits/node/cycle. These metrics do not consider router frequencies. According to Table 1 , if all routers are optimized to maximum frequencies, dateline is $30 percent slower than bubble designs. To make a fair comparison, we leverage frequency independent metrics. The seconds is used for latency comparison. Due to its lower frequency, dateline's cycle in seconds is 30 percent longer than bubble design's cycle in seconds. Thus, dateline's zero-load latency in seconds is 30 percent higher. The flits/ node/second is used for throughput comparison. Since dateline's cycle in seconds is longer than bubble design's cycle in seconds, dateline's throughput in flits/node/second drops. For example, its throughput in flits/node/second for uniform random is 7.5 percent lower than CBS.
Dateline divides buffers into two VCs. Shallow VCs make packets span more nodes, which increases chained blocking effect [55] ; this brings a packet forwarding limitation for dateline. Yet, dateline is superior to FBFC for injection and dimension-changing: long packets can inject or change dimensions with one free slot. We introduce a metric: injection/dimension-changing (IDC) count. The IDC count of a packet includes the number of times a packet is injected plus the number of times it changes dimensions.
The trends between dateline and FBFC depend on hop counts and IDC counts of the traffic. FBFC-C outperforms dateline for all patterns. A ring has no dimension changing; all patterns' IDC counts are no more than 1, hiding dateline's merit. The largest gains are 29.2 and 18.8 percent for transpose and tornado. Transpose's IDC and hop counts are 0.75 and 2.25. Tornado's IDC and hop counts are one and four. They reveal dateline's limitation on packet forwarding.
Buffer Utilization
To delve into performance trends, the average buffer utilization of all VCs is shown in Fig. 13 . The maximum and minimum rates are given by error bars. Buffers can support high throughput; higher utilization generally means better performance. For LBS and CBS, the average rates are 13.0 and 19.2 percent in saturation. They inefficiently use buffers. LBS requires more free buffers for injection; its utilization is lower than CBS. Dateline's minimum rate is always 0; one VC is never used. For example, VC 00 in Fig. 2 is not used. This scenario combined with chained blocking limits its buffer utilization. Dateline's average and maximum rates at saturation are 23.2 and 71.8 percent, while these rates are 39.5 and 89.8 percent for FBFC-C.
Latency of Short and Long Packets
FBFC's injection of long packets requires more buffers than short ones. Fig. 14 shows latency compositions. 'InjVC' and 'NI' are delays in injection VCs and network interfaces. 3. The saturation point is measured as the injection rate at which the average latency is three times the zero load latency.
'Network' is all other delays. Long and short packets are treated the same in LBS and CBS; they show similar delays at injection VCs and inside the ring. Long packets have four more flits; they spend $4 more cycles in network interfaces. With low-to-medium injection rates ( 20 percent) in FBFC-C, the delays at injection VCs for long and short packets are similar. With higher loads, the difference increases. Even when saturated, they are only 3.4 and 3.2 cycles for the two patterns. FBFC-C does not sacrifice long packets. FBFC-C's acceleration of short packets helps long packets since short and long packets are randomly injected. Indeed, compared with FBFC-C, LBS and CBS sacrifice short packets. For example, with a 20 percent injection rate for uniform random, short packets spend 11.7 and 5.4 cycles in injection VCs for LBS and CBS, and 4.3 cycles for FBFC-C.
EVALUATION ON 2D TORI
Section 6 analyzes the performance for 1D tori with buffer utilization and latency composition. This section thoroughly analyzes the performance for 2D tori with several configurations for further insights. Fig. 15 shows the performance for a 4 Â 4 torus with the baseline configuration. The error bars in Fig. 15a are average latencies of long and short packets. The average gains of FBFC-C over LBS and CBS are 92.8 and 34.2 percent. FBFC's high buffer utilization brings these gains. CBS shows an average 45.7 percent gain over LBS, and the highest one is 100 percent for transpose. Most transpose traffic is between the same row and column; many packets change dimensions at the same router requiring the same port. CBS's low dimension-changing buffer requirement yields high performance.
Performance for a 4 Â 4 Torus
Compared with the ring (Fig. 12) , the trends between FBFC-C and dateline for a 2D torus change. Dateline performs similarly to FBFC-C for uniform random and transpose. A 2D torus has a dimension changing step. The IDC counts of these patterns are both 1.5, and their hop counts are three. As a result, injection and dimensionchanging factor significantly into performance. Dateline's low injection and dimension-changing buffer requirement brings gains. Dateline outperforms FBFC-C by 5.7 percent for hotspot. This pattern sends packets from different rows to the same column of four hotspot nodes, exacerbating FBFC-C's injection limitation. FBFC-C outperforms dateline by 6.4 percent for bit rotation. Packets of bit rotation change dimensions by requiring different ports; the light congestion mitigates FBFC-C's limitation. These results do not consider frequencies. If routers are optimized to maximum frequencies, dateline's performance will drop.
As shown in Fig. 15a , the delay difference between long and short packets is almost constant for LBS and CBS; long packets have $4 cycles more delay. Dateline's difference is 6.2 cycles in saturation. FBFC-L's difference is 9.8 cycles. It is $3 cycles more than the ring; a 2D torus has one dimension-changing step. We measure the behavior after saturation by increasing the load to 1.0 flits/node/cycle. All designs maintain performance after saturation. DOR smoothly delivers injected packets. Adaptive routing may drop performance after saturation because escape paths drain packets at lower rates than injection rates [11] .
Sensitivity to SFP Ratios
As discussed in Section 4.2, VCT routers' lookahead credit return overlaps packet incoming and outgoing; a packet can move to a VC with one free slot. It brings an injection or dimension-changing benefit for CBS over FBFC. CBS's packet injection begins with one free slot. FBFC-C's long packet injection needs five normal slots. Thus, SFP ratios affect trends between CBS and FBFC. Fig. 16 conducts sensitivity studies on SFP ratios.
The performance of LBS and CBS increases linearly with reduced ratios; more long packets proportionally improve their buffer utilization. FBFC-C and dateline perform slightly better with lower ratios. They try to hold SA grants for entire packets; long packets reduce SA contention. Although FBFC-C's gains over CBS reduce with lower ratios, FBFC-C is always superior for most traffic patterns. FBFC-C performs 10.1 percent better than CBS for tornado with a 0.2 ratio. Tornado sends traffic from node ði; jÞ to ðði þ 1Þ%4; ðj þ 1Þ%4Þ. Each link only delivers traffic for one node pair; there is no congestion. Transpose is different; CBS outperforms FBFC-C with 0.4 and 0.2 ratios. This pattern congests turn ports, emphasizing CBS's injection benefit. We also experiment with a 64-bit flit width; the short and long packet have one and nine flits. With longer packets, the negative effect of regarding short packets as long ones in LBS and CBS becomes more significant. Meanwhile, FBFC-C needs more slots for long packet injection. These factors result in similar trends for a 64-bit configuration as those in Fig. 16 .
Sensitivity to Buffer Size
We perform sensitivity studies on buffer sizes. In Fig. 17a , CBS, FBFC-C and FBFC-L use minimum buffers that ensure correctness. They use five, five and six slots. Dateline uses two three-slot VCs. In Fig. 17b , bubble designs use a 15-slot VC, and dateline uses two eight-slot VCs. Although FBFC-C's injection limitation has higher impact with fewer buffers, dateline only performs 7.6 percent better than FBFC-C with five slots/VC. Dateline's shallow VCs (three slots) cannot cover the credit round-trip delay (five cycles), making link-level flow control the bottleneck [20] . FBFC-C's gain over CBS increases with fewer buffers. The gains are 26.6 percent with 15 slots/VC, 41.4 percent with 10 slots/VC, and 121.8 percent with five slots/VC. Comparing Figs. 15a and 17a, CBS with 10 slots/VC is similar to FBFC-C with five slots/VC. With half as many buffers, FBFC-C is comparable to CBS. LBS with 15 slots/VC (Fig. 17b) only has a 5.2 percent gain over FBFC-C with five slots/VC. LBS almost matches CBS with 15 slots/VC. More buffers mitigate LBS's high injection buffer limitation. Additional results show that with abundant buffers, bubble designs performs similarly; there is little difference among them as many free buffers are available anyway. The convergence points depend on the traffic. For example, due to congested ports in transpose, at least 30 slots/VC are needed for LBS to match CBS. For uniform random, 50 slots/VC are needed for CBS to match FBFC. Many buffers are required for convergence, which makes high buffer utilization designs, such as FBFC, winners in reasonable configurations. 
Scalability for an 8 Â 8 Torus
Effect of Starvation
LBS [10] has limited discussion of starvation; CBS [11] relies on adaptive routing and therefore does not address starvation. We analyze starvation in a 4 Â 4 torus. Reducing buffers makes starvation more likely. We use the same buffer size as in Fig. 17a (LBS uses 10 slots) with uniform random. Larger networks or other patterns are similar.
Starvation in LBS and FBFC-L is essentially the same. Fig. 19a shows their performance with three starvation threshold values. They perform poorly with the three-cycle STV. A small STV causes a router to frequently assert the 'starve' signal ( Fig. 9) to prohibit other nodes' injection, which negatively affects overall performance. We also evaluate the saturation throughput with several STVs ranging from three to 50 cycles. LBS and FBFC-L perform better with larger STVs until 30 cycles, and then remain almost constant. We set the STVs in LBS and FBFC-L to 30 cycles. CBS and FBFC-C can starve due to the critical bubble stall. Also, FBFC-C has the same starvation as FBFC-L. FBFC-C uses two STVs; one for each type of starvation. We fix the STV in FBFC-C for the same starvation as FBFC-L to 30 cycles, and analyze the other type of starvation. As shown in Fig. 19b , the smaller the STV, the higher the performance. The proactive transfer of critical bubble does not prohibit injection; even with many false detections, there is no negative effect. In contrast, the performance drops if packets cannot be injected for a long time. With a 20-cycle STV, if one node suffers starvation, it will move the critical bubble after 20 cycles. Then it starts injecting. This lazy reaction not only increases the zero-load latency, but also limits the saturation throughput. Since the proactive transfer of critical bubble needs two cycles, we set the critical bubble stall STV to three cycles. Fig. 20 shows the speedups relative to LBS for PARSEC workloads. FBFC supports higher network throughput, but system gains depend on workloads. FBFC benefits applications with heavy loads and bursty traffic. For blackscholes, fluidanimate and swaptions, different designs perform similarly. Their computation phases have few barriers and their working sets fit into caches, creating light network loads. They are unaffected by techniques improving network throughput, such as FBFC.
Real Application Performance
Network optimizations affect the other seven applications. Both CBS and FBFC see gains. The largest speedup of FBFC over LBS is 22.7 percent for canneal. Two factors bring the gains. First, these applications create bursty traffic and heavy loads. Second, the two VNs with hybrid-size packets have relatively high loads. Across the seven applications, VN0 and VN2 averagely have 70.8 percent loads, including read request, writeback request, read response, write-back ACK and invalidation ACK. These relatively congested VNs emphasize FBFC's merit in delivering variable-size packets. Across all workloads, FBFC and CBS have average speedups of 13.0 and 7.5 percent over LBS.
Compared with synthetic traffic, the real application gains are lower. It is due to the configured CMP places light pressure on network buffers. Other designs, such as concentration [20] or configuring fewer buffers, can increase the pressure. FBFC shows larger gains in these scenarios. For example, we also evaluate performance with five slots/VN. FBFC-C achieves an average 9.8 and maximum 20.2 percent speedup over CBS. Also, the application runtime of FBFC-C with five slots/VN is similar to that of LBS with 10 slots/VN.
Large-Scale Systems and Message Passing
The advance of CMOS technology will integrate hundreds or thousands of cores in a chip [9] . Some current many-core chips, including 60-core Xeon Phi [28] , use the shared memory paradigm. Yet, cache coherence faces scalability challenges with more cores. Message passing is an alternative paradigm. For example, the 64-core TILE64 uses a message passing paradigm [57] . It remains an open problem to design an appropriate paradigm for large-scale systems [41] . We evaluate FBFC for both paradigms. As a case study, we use a 256-core platform organized as a 16 Â 16 torus.
In large-scale systems, assuming uniform communication among all nodes is not reasonable [47] . Workload consolidation [38] and application mapping optimizations [14] increase traffic locality; we leverage an exponential locality traffic model [47] , which exponentially distributes packet hop counts. For example, with distribution parameter ¼ 0:5, the traffic average hop is 1= ¼ 2, and 95 percent traffic is within six hops, and 99 percent traffic is within 10 hops. We evaluate two distribution parameters, ¼ 0:5 and ¼ 0:3.
The packet length distribution of shared memory traffic is the same as the baseline configuration in Section 5; 80 percent packets have one flit, and the others have five flits. All designs use 10 slots per VN. We assume that the packet length distribution of message passing traffic is similar to BlueGene/L; packet sizes ranges from 32 to 256 bytes [4] . With a 16-byte flit width, packet lengths are uniformly distributed between two to 16 flits. All designs use 32 slots per VN. Fig. 21 shows the performance. The overall trends among different designs are similar to an 8 Â 8 torus (Fig. 18) . LBS and CBS are limited by inefficient buffer utilization. LBS is further limited by its high injection buffer requirement. Dateline's limitation for packet forwarding restricts its performance. FBFC efficiently utilizes buffers, and yields the best performance. The performance gaps between FBFC and other bubble designs depend on distribution parameter . The smaller the , the larger the average hops. Larger average hops emphasize efficient buffer utilization; thus, FBFC gets more performance gains. For shared memory traffic, FBFC-C performs 47.7 percent better than CBS with ¼ 0:5, and this gain is 66.5 percent with ¼ 0:3. Message passing traffic shows similar trends. FBFC-C performs 18.7 percent better than CBS with ¼ 0:5, and the gain increases to 23.8 percent with ¼ 0:3.
FBFC's gains for message passing traffic is lower than shared memory traffic. With ¼ 0:5, FBFC-C performs 105.2 percent better than LBS for shared memory traffic, while it is 68.8 percent for message passing traffic. The packet length distributions are different. For shared memory traffic, the average packet length is 1.8 flits, and LBS regards each packet as a five-flit packet. The average packet length of message passing traffic is nine flits, and LBS regards each packet as a 16-flit packet. The maximum packet length of shared memory traffic is $2.8 times of the average length, while it is 2 for message passing traffic. This brings the drop of FBFC-C's gain for message passing traffic.
OVERHEADS: POWER AND AREA
This section conducts power and area analysis of our designs. We also compare tori with meshes.
Methodology
We modify a NoC power and area model [5] , which is integrated in Booksim [30] . We calculate both dynamic and static power. The dynamic power is formulated as P ¼ aCV 2 dd f, with a the switching activity, C the capacitance, V dd the supply voltage, and f the frequency. The switching activities of NoC components are obtained from Booksim. The capacitance, including gate and wire capacitances, is estimated based on canonical modeling of component structures [5] .
The static power is calculated as P ¼ I leak V dd , with I leak the leakage current. The leakage current is estimated by taking account of both component structures and input states [5] . For example, the inserted repeater, composed of a pair of pMOS and nMOS devices, determines the wire leakage current. Since pMOS devices leak with high input and nMOS devices leak with low input, the repeater leaks in both high and low input states. The wire leakage current is estimated as the average leakage current of a pMOS and an nMOS device [5] .
The router area is estimated based on detailed floorplans [5] . The wires are routed above other logic; the channel area only includes the repeater and flip-flop areas. The device and wire parameters are obtained from ITRS report [29] for a 32 nm process, at 0.9 V and 70 C. All designs are assumed to operate at 1 GHz based on a conservative assumption.
We assume a 128-bit flit width. The channel length is 1.5 mm; an 8 Â 8 torus occupies $150 mm 2 . Repeaters are inserted to make signals traverse a channel in 1 cycle. The number and size of repeaters are chosen to minimize energy. VCs use SRAM buffers. We assume four VNs to avoid protocol-level deadlock. Allocators use the separable input-first structure [20] . We leverage the segmented crossbar [56] to allow a compact layout and reduce power dissipation. Packets are assumed to carry random payloads; two sequential flits cause half of the channel wires to switch. overall power. These components are similar for all designs. The allocator consists of combinational logic, and it induces low power.
Power Efficiency
Bubble designs' credit channel has three wires; two bits encode four VCs and one valid bit. Dateline's credit channel has four wires; three bits encode eight VCs and one valid bit. The starvation channel of LBS and FBFC-L has six wires. Three bits identify one node among eight nodes of a ring. Two bits encode four VNs, and one valid bit. CBS's starvation channel uses six wires. 'N2C' and 'C2N' signals need two bits, and both signals use two bits to encode four VNs. FBFC-C handles two types of starvation; its starvation channel has 12 wires. These credit channels and starvation channels are narrower than flit channels; they induce low power.
To clarify differences, Fig. 22b shows the allocator and sideband channel power. Starvation channels are not needed for dateline. Yet, their activities are low. For example, FBFC-L's starvation channels keep idle until 34 percent injection rate. Dateline uses large allocators. Also, dateline's credit channel has one more wire than bubble designs. Dateline consumes higher power than bubble designs with 20 and 40 percent injection rates.
Although bubble designs' credit channels are narrower than starvation channels, two reasons cause credit channels to consume higher dynamic power. First, starvation channels are not needed for injection/ejection ports. An 8 Â 8 torus has 384 credit channels and 256 starvation channels. Second, credit channels' activities are higher. VCT routers return one credit for each packet, and wormhole routers return one credit for each flit. Credit channels of LBS and CBS consume lower dynamic power than FBFC. Fig. 23a evaluates the power-delay product. Compared with LBS and CBS, FBFC reduces latencies for heavy loads, improving PDP. With 20 and 30 percent injection rates, FBFC-L's PDP is 5.9 and 56.9 percent lower than LBS. FBFC-L's PDP is 27.6 percent lower than CBS at a 30 percent injection rate. Dateline's power and latency are similar to FBFC. Its PDP is similar to FBFC. We also evaluate other traffic patterns. The trends are similar to uniform random. All designs consume similar power, and FBFC's latency optimization reduces the PDP. For example, with transpose, FBFC-L's PDP is 34.2 percent lower than LBS at a 15 percent injection rate, and its PDP is 28.9 percent lower than CBS at a 20 percent injection rate.
To show the impact of network scaling on power efficiency, Fig. 23b gives the PDP on a 16 Â 16 torus. The exponential locality traffic with ¼ 0:3 in Section 7.7 is used;
¼ 0:5 has a similar trend. Since FBFC optimizes latencies, it still offers power efficiency gains in larger NoCs. FBFC-L's PDP is 32.7 and 18.2 percent lower than LBS and CBS for a 30 percent injection rate, and its PDP is 26.6 percent lower than CBS for a 40 percent injection rate.
As shown in Section 7.3, with half as many buffers, FBFC performs the same as CBS in an 8 Â 8 torus. With one third of buffers, FBFC performs similarly to LBS. In summary, with the same buffer size, all designs consume similar power. FBFC's starvation channels induce negligible power. Since FBFC significantly outperforms existing bubble designs, it achieves much lower PDP, in both 8 Â 8 and 16 Â 16 tori. When bubble designs perform similarly with different buffer sizes, FBFC consumes lower power and offers PDP gains. Fig. 25 shows the area results. The areas of flit channel and crossbar are similar for all designs. With the same buffer amount, the area differences among designs are mainly due to the allocator, credit channel and starvation channel. Dateline's allocator is $2 times larger than bubble designs; this causes dateline to consume most area. With 10 slots per VN, dateline's overall area is 7.4 percent higher than FBFC-L. When bubble designs perform similarly, FBFC's area benefit is more significant. CBS's network area with 10 slots per VN is 15.6 percent higher than FBFC-C with five slots per VN. The overall area of LBS with 15 slots per VN is 31.9 percent higher than FBFC-C with five slots per VN.
Area
Comparison with Mesh
We compare tori with meshes. The routing is DOR. The torus uses FBFC to avoid deadlock. Based on a floorplan model [5] , the mesh channel length is 33.3 percent shorter than the torus one; it is 1.0 mm. Inserted repeaters make the channel delay be one cycle. Two meshes are evaluated. One uses a 128-bit channel width, which is the same as the torus. Yet, its bisection bandwidth is half of the torus. The other mesh uses a 256-bit width to achieve the same bisection bandwidth as the torus. All networks have the same packet size distribution. The 128-bit width mesh uses 10 slots per VN. Its traffic has 80 percent five-flit packets and 20 percent one-flit packets. The 256-bit width mesh uses five slots per VN. Its traffic has 80 percent three-flit packets and 20 percent one-flit packets. Fig. 26 gives the performance. Since flit sizes of evaluated networks are different, the 'injection rate' is measured in 'packets/node/cycle'. Torus' wraparound channels reduce hop counts. For both patterns, the torus shows $20 percent lower zero-load latencies than the mesh. With half of the bisection bandwidth, mesh-128bits' saturation throughput is 24.2 and 37.0 percent lower than the torus for uniform random and transpose. With the same bisection bandwidth, mesh-256bits' saturation throughput is similar to the torus for uniform random. The transpose congests mesh's center portion [43] ; mesh-256bits' saturation throughput is still 17.3 percent lower than the torus. Fig. 27 shows the power and area. With the same channel width, mesh-128bits' static power is 30.5 percent lower than the torus; it is due to the optimization of buffers and flit channels. An 8 Â 8 mesh has 288 ports with buffers, while an 8 Â 8 torus has 320 ports. Two factors brings power reduction for mesh channels. First, an 8 Â 8 mesh has 352 flit channels, while an 8 Â 8 torus has 384 flit channels. Second, mesh channels are 33.3 percent shorter than torus ones. Thus, even 256-bit mesh channels consume less static power than torus ones. Yet, the 256-bit channel width quadratically increases crossbar power. The mesh-256bits' overall static power is 77.7 percent higher than the torus.
With high loads, mesh's center congestion increases dynamic power; mesh-128bits' benefit over FBFC-L decreases. With 10 and 20 percent injection rates, its power is 15.7 and 10.9 percent less than FBFC-L. With a 20 percent injection rate, mesh-256bits' channel power is higher than FBFC-L. The PDP reflects power efficiency. With a 10 percent injection rate, FBFC-L's PDP is 6.4 and 64.9 percent less than mesh-128bits and mesh-256bits. Its power efficiency increases with high loads. With a 20 percent injection rate, FBFC-L's PDP is 33.7 percent less than mesh-128bits.
The network area of mesh-128bits is 17.4 percent less than FBFC-L. The mesh-128bits uses fewer buffers and channels. Due to the large crossbar of mesh-256bits, its network area is 107.2 percent higher than FBFC-L.
We also compare a 16 Â 16 mesh with a 16 Â 16 torus. The static power and area benefits of mesh-128bits decrease with larger networks. The mesh's benefit of using fewer flit channels and ports decreases. In a 64-node network, the torus has 9.1 percent more flit channels and 11.1 percent more ports than the mesh. These numbers are 4.3 (1,536 versus 1,472) and 5.3 percent (1,280 versus 1,216) in a 256-node network. mesh-128bits' static power in a 16 Â 16 mesh is 24.3 percent less than that of a 16 Â 16 torus, and its network area is 13.4 percent less. The PDP is more favor to FBFC. FBFC-L's PDP is 24.5 and 57.9 percent less than mesh-128bits and mesh-256bits at a 15 percent injection rate for the exponential locality traffic with ¼ 0:3.
In summary, although with the same channel width, the mesh consumes less power and area than the torus, its performance is poor due to limited bisection bandwidth. With FBFC applied, the torus is more power efficient than the mesh for the same channel width. With the same bisection bandwidth, the mesh consumes much higher power than the torus. Applying FBFC on the torus is well scalable.
9 DISCUSSIONS AND RELATED WORK 9.1 Discussions FBFC efficiently addresses the limitations of existing designs. It is an important extension to packet-size bubble theory. The insight of 'leaving one slot empty' enables other design choices. For example, combining with dynamic packet fragmentation [24] , a packet can inject with one free normal slot. When one flit's injection will consume the critical slot, this packet stops injecting and changes the waiting flit into a head flit. This design allows VC depths to be shallower than the largest packet. Based on a similar insight, an efficient deadlock avoidance design is proposed for wormhole networks [12] . Its basic idea is coloring buffer slots into white, gray or black to convey global buffer status locally. This is different from our design. FBFC uses local buffer status with hybrid flow control which combines VCT and wormhole. Also, we mainly focus on improving the performance for coherence traffic, which consists of both long and short packets.
Similar to dateline and packet-size bubble designs, FBFC is a general flow control. It can be adopted in various topologies as far as there is a ring in the network. For example, Immunet [50] achieves fault-tolerant by constructing a ring in arbitrary topologies for connectivity. The ring uses LBS to avoid deadlock. Instead, FBFC can be used; it will support higher performance with fewer buffers. MRR [2] leverages the ring of rotary router [1] to support multicast. The ring uses LBS. FBFC can be used as well. By configuring a ring in the network, FBFC can support both the unicast and multicast for streaming applications [3] . Also, FBFC can support fully adaptive routing [11] , [49] . Bubble designs use one VC; there is head-of-line blocking. FBFC can combine with dynamically allocated multi-queue [46] , [54] to mitigate this blocking and further improve buffer utilization.
Related Work
The Ivy Bridge [21] , Xeon Phi [28] and Cell [32] use ring networks. The ring is much simpler than the 2D or high dimensional torus, and it is easy to avoid deadlock through endto-end backpressure or centralized control schemes [32] , [34] . The ring networks of these chips [21] , [28] , [32] guarantee injected packets cannot be blocked, which is similar to bufferless networks. Bufferless designs generally do not consider deadlock as packets are always movable [24] . Our research is different. We focus on efficient deadlock avoidance designs for buffered networks, which supports higher throughput than bufferless networks. Except for dateline [20] , LBS [10] and CBS [11] , there are other designs. Priority arbitration is used for single-flit packets with single-cycle routers [34] . Prevention flow control combines priority arbitration with prevention slot cycling [31] ; it has deadlock with variable-size packets. Turn model [26] only allows non-minimal routing in tori. A design allows deadlock formation, and then applies a recovery mechanism [52] .
FBFC observes that most coherence packets are short. Several designs use this observation, including packet chaining [42] , the NoX router [27] and whole packet forwarding [39] . Configuring more VNs, such as 7 VNs in Alpha 21364 [44] , can eliminate co-existence of variable-size packets. Yet, additional VNs have overheads; using minimum VNs is preferable. DASH [37] , Origin 2000 [36] , and Piranha [6] all apply protocol-level deadlock recovery to eliminate one VN; they utilize two VNs to implement three-hop directory protocols. These VNs all carry variable-size packets.
CONCLUSION
Optimizing NoCs for coherence traffic improves the efficiency of many-core coherence protocols. We observe two properties of cache coherence traffic: short packets dominate the traffic, and short and long packets co-exist in NoC. Then we propose an efficient deadlock avoidance theory, FBFC, for torus networks. It maintains one free flit-size buffer slot to avoid deadlock. Only one VC is needed, which achieves high frequency. Also, FBFC does not treat short packets as long ones; this yields high buffer utilization. With the same buffer size, FBFC significantly outperforms LBS and CBS, and is more power efficient as well. When bubble designs perform similarly, FBFC consumes much less power and area. With FBFC applied, the torus is more power efficient than the mesh.
Zonglin Liu received the BS and PhD degrees in mechanical engineering from the National University of Defense Technology (NUDT) in 1998 and 2004, respectively. He is currently an associative professor at the School of Computer, NUDT. His research interests include general purpose processor and DSP architecture and designs, and VLSI logic designs.
Natalie Enright Jerger received the BASc degree from the Department of Electrical and Computer Engineering at Purdue University, and the MSEE and PhD degrees from the Department of Electrical and Computer Engineering, University of Wisconsin-Madison. She is currently an assistant professor in the Electrical and Computer Engineering Department at the University of Toronto. Her research interests include onchip networks, many-core architectures. and cache coherence protocols. She is a senior member of the IEEE and a member of the ACM.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
