Many resources are shared among the cores of chip-multiprocessors (CMPs), in particular on-chip caches and memory systems. Efficient intra-chip communication is necessary for efficient resource sharing and the performance of such systems, especially in future CMPs with hundreds or thousands of cores. Current Free-space optical networks-on-chip (NoCs) provide the potential to avoid the reduced wire performance and degraded signal integrity facing electronic networks. However, current proposals utilize fixed direction lasers and mirrors to realize one-hop all-to-all connectivity, which results in difficulties scaling to larger numbers of processors. In this paper we present two-hop optical strategies that provide better performance over the one-hop strategy while improving on both the required resources and scalability for future large scale CMPs.
INTRODUCTION
A crucial factor to improving the performance and scalability of chip-multiprocessors (CMPs) is reducing data access latency. Efficient networks-on-chip are essential for the operation of CMPs to achieve good performance since they carry the cache coherence and data traffic. In the ideal case, any two communicating nodes should be one-hop away from each other. In a system of N nodes, this one-hop all-to-all connectivity requires the realization of O(N 2 ) links, which is prohibitively expensive if done using electronics.
BACKGROUND: A 1-HOP FREE-SPACE OPTICAL NOC
We describe briefly in this section the baseline 1-hop free-space optical NoC proposed in [20] . VCSELs and PDs operate at 40 GHz such that during a 3.3 GHz processor cycle, one VCSEL can transmit 12 bits of data. Network packets are classified into meta and data packets. Meta packets represent coherence messages such as data requests, invalidations, and acknowledgments, while data packets correspond to the actual data of cache lines. Assuming a flit size of at least 64 bits, meta packets are one flit long, while the size of data packets varies based on the cache line size or the amount of data that is transmitted. For example, in [20] , data packets are 5 flits, each carrying 32 bytes of data (a cache line). Assuming a system of N nodes and a link width of k VCSELs, each node needs k(N − 1) VCSELs to have a 1-hop communication with all nodes, thus the number of VCSELs per node scales linearly with the number of nodes in the system. However, shared rather than dedicated receivers are used. Consequently, each node has only a constant k PDs to receive from all other nodes regardless of N . Unfortunately, this results in collisions at the receiver that must be detected and, if possible, avoided.
Reducing Probability of Collisions: For packets that need multiple cycles to be transmitted, slotting [14] and lane separation are two techniques for reducing collision probability. For example, transmission of a 5-flit packet can only start at the beginning of a 5-cycle slot. Since meta and data packets differ in length, packet transmission is separated into a meta lane of width kM VCSELs and a data lane of width kD VCSELs, respectively. Collision probability is further reduced through doubling the receiving bandwidth per lane such that half the N − 1 nodes send packets to one receiver and the other half sends to the other.
Detecting and avoiding collisions: The id of the sending node, denoted by P ID, along with its complement P ID are encoded into the first flit of each packet. Due to slotting, the first flit of a packet arrives on the first cycle of the transmission slot. The receiver sends a confirmation signal (one bit) to the sender in the case of successful transmission, i.e., no collisions, at cycle c + 2, where c is the start cycle of the slot. If a collision occurs, this is detected as at least one of the P ID or P ID bits will be flipped (light from one flit will override darkness from a colliding flit) and they would not match and no confirmation is sent. The absence of a confirmation signal indicates that a collision occurred at the receiver and the senders must each retransmit the packet. An exponential back-off heuristic is used in [20] .
A 2-HOP FREE-SPACE OPTICAL NOC
We describe in detail our proposed 2-hop free-space optical NoC. This interconnect connects the data-lane links for any two nodes using at most 2 optical links. To offset the delay introduced by the extra optical link-compared to the 1-hop optical NoC in section 2-we need to at least double the link bandwidth. Increasing the link bandwidth benefits data packets, which consist of multiple flits. For single-flit meta packets, we retain the 1-hop all-to-all optical connectivity.
Topology and Routing: Assuming the N nodes are laid out in a rectangular n × m grid, each node ix,y that lies on row y and column x of the grid, is directly connected though an optical link to each other node located on the same row, y, or the same column, x, as depicted in Fig. 1(a) . Thus, we trade-off the number of links with increased link bandwidth. For square grids the number of data packet transmitters per node scales linearly with √ N −1, compared to N − 1 per-node for the all-to-all network, leading to savings in the number of transmitters as the number of nodes increases. In general, the savings depend on the values of n and m, where N = n × m as evident by the formulas for the number of data VCSELs per node shown in Table 1 .
Data packets are routed using the X-Y routing algorithm where a packet is first transmitted in the X-direction (i.e., horizontally) and then the Y-direction (i.e., vertically). For example, to send a packet from node ix,y to node i x ,y , where x = x and y = y , the packet is first transmitted to the intermediate node i x ,y and then to the final destination, i x ,y . Of course if the source and destination nodes are located on either the same row or the same column, then transmission occurs only in the X-or the Y-direction, respectively. Collisions are possible in this interconnect and we use the same mechanisms employed by the 1-hop optical NoC to reduce collision probability, detect collisions and handle retransmissions of packets. Preventing Deadlock: Consider the example in Fig. 1(b) : node A wants to send packet m1 to node C, and node D wants to send packet m2 to node B. If B and C lie on the same column, while A lies on the same row as B, and D lies on the same row as C then according to the X-Y routing policy, A should send m1 to B which then relays it to C. Similarly D should send m2 to C which relays it to B. If A and D successfully send to B and C, respectively and the receiver buffers RB and RC storing m1 and m2 in B and C, respectively, do not have free space to receive other packets a deadlock forms. RC cannot accept m1 which remains in RB, thus making RB unable to accept m2, which in turn remains in RC . To prevent deadlock, we take advantage of the two sets of PD receivers from [20] dedicating the first for X-direction traffic and the second for Y-direction traffic. Specifically, let RDX and RDY denote the two sets of kD PDs for receiving X and Y-direction traffic, respectively. Thus, only the RDX receiver buffer may send a packet to an RDY receiver, but not vice versa since packets received by RDY are always delivered to the local node, which prevents deadlock.
Router Design: The router design is very similar to that of the baseline 1-hop optical NoC described in section 2, with the additional support for relaying packets received by intermediate nodes to their intended recipients. The router has two input buffers, one for data and one for meta packets, and 4 output (receiver) buffers, where 2 are for data and 2 are for meta packets. We will focus on the data packets since the meta lane component is unchanged from [20] . In the data lane an arbiter is added between the local node data input buffer and the RDX output buffer used to buffer data flits for retransmission in the Y-direction. Consider the diagram of the router shown in Fig. 2 . The flits of data packets injected by the local node are first stored in the input buffer then transmitted as per the X-Y routing policy. The output buffers store the flits received by the PDs of RDX and RDY . All packets received by RDY are delivered to the local node, while packets received by RDX may either be intended for the local node or require retransmission to its final destination.
Regarding the receivers of meta packets, we refer to the two sets of kM PDs as RME and RMO (Fig. 2) , where even numbered nodes send to RME and odd numbered nodes send to RMO.
Reducing Serialization Delay: Shifting the slot of data transmission in the Y-direction: Consider the common case where the data packet is transmitted in both the X and Y directions. When the head (first) flit of a data packet is received by RDX, the router examines it to decide whether to deliver the packet to the local node or transmit it to the final destination node. If the transmission slots for sending in both the X and Y directions coincide in time, a packet sent in the X-direction at slot sj, cannot be sent in the Y-direction during the same slot since the router cannot simultaneously receive and re-transmit the same flit. As the example in Fig. 3 shows, an L-flit packet needs 2L cycles to reach its destination (assuming no queuing, contention or collision delays) as follows: 1 (head flit traverses optical link to intermediate node) + (L − 1) (the remaining flits of the packet traverse the first optical link while the head flit remains at the intermediate node waiting for the next transmission slot) + 1 (head flit is transmitted to the destination node) + (L − 1) (the remaining flits follow the head flit) = 2L cycles. However, if the Y-direction transmission slots lag the X-direction slots by 2 cycles-to allow the router to examine the flit and copy it to the transmission buffer if there is a need -the packet can reach its destination in only L + 2 cycles (see the example in Fig. 3 ) as follows: 1 (head flit traverses optical link to intermediate node) + 2 (head flit is examined at the intermediate node and transmitted to the destination node) + (L − 1) (the remaining packet flits follow the head flit) = L + 2 cycles, also assuming no queuing, contention or collision delays. Table 2 lists the formulas for the least (best) packet latencies in each NoC as a function of packet size in flits, L. 
A MULTICOLOR 2-HOP NOC
We now describe our proposed multicolor 2-hop NoC. Nodes are partitioned into C groups, where each group of nodes is conceptually considered to have a distinct color. We classify links into intracolor and inter-color links. As their names suggest, an intra-color link connects two nodes of the same color, while an inter-color link connects two nodes of two different colors. We achieve a maximum of 2-hop communication between any two nodes by having: 1) allto-all intra-color links connecting the nodes of the same color, and 2) each node of color c is connected by an inter-color link to exactly one node of each color c = c. Thus, any two nodes of the same color and any two nodes that are connected by an inter-color link can communicate in 1-hop. Consider the communication between any two nodes ic of color c, and node j c of color c = c, which are not directly connected by an inter-color link. Assume ic wants to send a data packet, m, to j c . Node ic first sends m over an intra-color link to the node kc, which is also of color c, and which is connected to j c through an inter-color link. Second, kc sends m to j c over their connecting inter-color link. In the absence of delays due to queuing or collisions, an L-flit packet traveling on a 2-hop path needs L + 2 cycles to reach its destination as follows: 1 (head flit traverses intra-color link to intermediate node) + 2 (head flit is examined at the intermediate node and transmitted to the destination node over the inter-color link) + (L − 1) (the remaining packet flits follow the head flit) = L + 2 cycles.
Similar to the 2-hop optical NoC described in section 3, we use at least double the link bandwidth of the baseline 1-hop NoC. With free-space optical communication collisions can occur. Similar to the 1-hop NoC, we use slotting and lane separation between data and meta packets (for meta packets we retain the all-to-all 1-hop connectivity among all the N nodes). We also double the receiving bandwidth to reduce collisions, and use the same mechanisms of collision detection and exponential back-off for retransmitting collided packets.
We now compute the number of data transmitters (VCSELs) required per node for a C-color 2-hop optical NoC. Let Ni be the set of nodes of color i, where i ∈ S = {0, ..., C − 1} and S is the set of C colors. By design, we have C−1 i=0 |Ni| = N , where |Ni| is the cardinality of Ni. If N is divisible by C, then |Ni| = N C ∀i ∈ S (the formulas in Table 3 assume that N is divisible by C). In general the number of required VCSELs for a node ∈ Ni is:
is the number of VCSELs per link in the baseline 1-hop NoC, and xkD is the number of VCSELs per link in this C-color 2-hop NoC. Deadlock-Free Routing: Because this is a 2-hop scheme like the X-Y scheme, deadlock lock can also occur unless we separate the receivers as described above in Section 3. As an example, we describe the topology and router design for a 3-color 2-hop NoC. We assume the N nodes are laid out in a rectangular n × m grid. We partition the nodes into 3 subsets of almost equal size (subset sizes may differ by 1). Each node of color c ∈ S = {0, 1, 2} is directly connected through an intra-color optical link to each other node of the same color, c. We can color the nodes such that the inter-color links run only between adjacent nodes. Although we have described the architecture in terms of optical inter-color links, the locality of these links actually allow them to be more efficiently implemented in electronic, rather than optics. Such an implementation would reduce power as shown by Table 4 . In the rest of this paper, we will only discuss multicolor architectures with electronic inter-color links. Fig. 4(a) shows an example of the node coloring scheme: starting with the top left node, visit the nodes in a snake like fashion, i.e., visit the nodes of the top row from left to right, then visit the nodes of the second top row from right to left, and so on. Color the first visited node 0. Set the color of the next visited node, ci+1, to (ci + 1) mod 3, where ci is the color of the last visited node. Electronic links are laid in a snake like fashion as in Fig. 4(a) . We add two extra nodes, i h , it, one adjacent to the head node of the snake (top left node) and the other adjacent to the tail node of the snake, respectively. These extra nodes have optical receivers but no transmitters, and each is connected through an electronic link to the adjacent node at the snake's end (see Fig. 4(a) ). i h is colored 2, while it is colored, (c l + 1) mod 3, where c l is the color of the tail node of the snake. i h and it only act as intermediate nodes in 2-hop paths; they ensure the head and tail nodes of the snake are each connected to two nodes of the other two colors.
The i h and it nodes are not needed in the special case where there is an even number of rows and the number of columns is divisible by 3, e.g., a 6x6 CMP (see Fig. 4(b) ). Instead of the snake topology, we consider the rows of nodes in pairs starting from the top, and use an electronic ring to connect the nodes of each pair of rows. Router Design: This router has two additional buffers compared to the 2-hop NoC (section 3) or the baseline 1-hop NoC for storing the flits received over the two electronic links. Because the buffers used for the electronic links are different from the buffers used for the optical links deadlock cannot occur. Since we have 2 sets of optical data receivers, we partition them into RDE and RDO, where even numbered nodes send to RDE and odd numbered nodes send to RDO. For each packet, m, received by either RDE or RDO, the router decides whether to deliver m to the local node or transmit it over an electronic link to one of the two neighboring nodes. Note that the router arbitrates between the packets in the buffers of RDE and RDO and the data input buffer, as more than one data packet may need to go over the same electronic link. The delivery to the local node considers the packets received through the two electronic links in addition to the RDE, RDO, RME, and RMO receivers. Fig. 5 shows a diagram of the router design.
RELATED WORK
Optical interconnections have been extensively studied for multiprocessor systems (see for example [4, 9, 12, 13] ) and advances in nanophotonics [10, 15] have motivated many recent optical NoC proposals. Shacham et al. [16] propose a hybrid waveguide-based optical NoC architecture with a complementary electronic NoC for control messages. Kurian et al. [8] propose a hierarchical interconnection with a global optical (bus-like) network that utilizes wavelength division multiplexing (WDM) to avoid contention. In [11] , Morris Jr. and Kodi propose a multilevel nanophotonic that combines WDM, space division multiplexing (SDM), optical tokens, and nanophotonic crossbars to develop a two-hop network for CMPs. Zhang and Louri [22] propose a multilayer photonic network-onchip, which leverages 3D integration to provide a global crossbarlike connectivity using layers of waveguides.
EVALUATION
In this section we compare our proposed 2-hop optical NoC (2hop) and multicolor 2-hop NoC (3-color) from sections 3 and 4, respectively, with the baseline 1-hop optical NoC (1-hop).
Evaluation Methodology
To explore the NoC design space we consider several parameters including: a. Size of CMP ranging from 8x8 to 12x12. b. The ratio of the link bandwidth of the 2-hop and 3-color NoCs relative to the link bandwidth of the 1-hop NoC, denoted BW ; we consider BW = 2, 2.5 and 3.33. c. Different sizes of data packets which correspond to cache line sizes of 32, 64, and 128 bytes. d. Various communication-to-computation (C2C) ratios.
Due to this relatively large design space, we take a 2-step approach to evaluation. First, we explore the design space with synthetic traces. Synthetic traces allow us to both keep the simulation time manageable, and easily vary the C2C ratio. In our traces each node injects 20,000 data requests to random destinations. The receiver of a data request replies with a data packet to the requester. We evaluate with the following request injection rates, r = 0.5, 1, 3, 5, 10 requests per 100 cycles. Second, we use benchmarks from the SPLASH2 [18] and PARSEC [1] suites to more rigorously evaluate a smaller subset of the design space. We use the functional simulator Simics [17] to simulate running these benchmarks. To keep the simulation time manageable, we simulate a CMP of size 8x8. Each core has a private L1 I/D cache and a slice of the distributed shared L2 cache [6] . The CMP and cache parameters are listed in Table 5 . We base our network parameters on the ones used for the evaluation of the baseline 1-hop NoC [20] . In [20] , a meta packet is 72 bits and a data packet is 360 bits. The width of the meta lane is kM = 3 VCSELs, which transmits a meta packet in 2-cycles. The width of the data lane is kD = 6 VCSELs, which transmits a 5-flit data packet in 5-cycles. Since a 5-flit data packet carries the data of a 32-byte cache line [20] , we use 10-and 20-flit data packets for transferring the contents of 64-byte and 128-byte cache lines, respectively.
For the 2-hop and 3-color NoCs, the size of data flits depends on the link bandwidth. For example, consider a 5-flit packet in the baseline 1-hop NoC. With BW = 2.5 for the 2-hop and 3-color NoCs, we have kD = 15 VCSELs and data flits become 180 bits; making data packets 2-flits long and requiring 2-cycles to transmit. Table 5 lists the buffer sizes of the network routers.
We evaluate the average data packet latency, execution time, and overall energy consumed in buffers and data transfer across links (i.e., link energy) (We use the power parameters in table 4 for computing link-energy). Our proposed network configurations impact link and buffer energy in interesting ways: 1. Compared to the baseline 1-hop NoC, our networks reduce the overall number of VCSELs (Table 1 ) which contributes to reducing total link energy. 2. However, wider links require more PDs, thus potentially having the opposite effect of increasing total link energy. 3. Execution time also affects energy consumption; in most cases, as we show below, we observe reduced execution time, which contributes to reducing the overall link energy. 4. Similarly, the network topology impacts the utilization of router buffers: the more available network links, the more one-hop paths in the network, the less time packets spend in router buffers, and vice versa.
Note that we use the optimization of shifting the Y-direction transmission slot by 2-cycles with the 2-hop NoC throughout the results-provided the transmission slot is more than 2 cycles 1since we found that it consistently improves this NoC's performance.
Impact of Different Cache Line Sizes and Different Request Injection Rates
In this evaluation we fixed the CMP size to 10x10. However, in the following subsection we present a scalability study on a 8x8, 10x10, and 12x12 CMPs. Note that for BW = 2 and 3.33, we do not simulate 2-hop and 3-color systems with 32-byte cache lines since in these cases the data packet is not divisible into an integral number of flits.
Network Latency: As can be seen in Fig. 6 , with increased injection rates, the average packet latency increases more quickly with the 2-hop NoC compared to the 3-color NoC. Increased injection rates are accompanied by increased packet collisions and retransmissions. However, the additional links connecting pairs of nodes directly in the 3-color NoC, compared to the 2-hop NoC, provide two advantages: 1) Reduced probability of collision, and 2) Higher probability of packets traversing only one link to reach their destinations.
Speedup: Speedup is computed based on execution time or the time until the delivery of the last packet. Lower network latency contributes to faster execution. The network latency's contribution to the execution time is limited by the C2C ratio. Low injection rates mean low C2C, resulting in only modest speedups, as can be seen in Fig. 7 . Conversely, high C2C ratios increase the likelihood of collisions, causing retransmissions and possibly increasing network delays. This is why the speedup of the 2-hop NoC peaks with r < 10 for BW = 2 and 2.5, while the wider links of BW = 3.33 allow the speedup of the 2-hop to continue to improve even with r = 10. On the other hand, the 3-color NoC with its more direct links is able to continue to show greater speedups with even larger cache lines and injection rates for all considered link bandwidths.
Link and Buffer Energy:
In almost all configurations, the 2hop network consumes less total link-energy than both the 3-color and the 1-hop NoCs. In Fig. 8 we show the normalized link and 0%   20%   40%   60%   80%   100%   120%   2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2- 0%   20%   40%   60%   80%   100%   120%   2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2- Normalized Average Packet Latency Relative to 1-hop NoC With respect to buffer energy, the 3-color NoC consistently consumes the least energy, since packets spend little time in the network buffers thanks to the availability of many wide direct links. Conversely, buffer utilization is higher in the 2-hop NoC. We note in Fig. 8 , that for 32-byte cache lines, the 2-hop NoC consumes even more buffer-energy than the 1-hop NoC. In the configurations using 32-byte cache lines, the reduction in average packet latency (Fig. 6) is not enough to compensate for the energy consumed by the wider buffer slots (compared to the 1-hop NoC).
Scalability
Area Requirements: We start by comparing the area requirements of the 2-hop and 3-color NoCs to that of the 1-hop NoC. Table 6 lists the area savings of the optical transmitters and receivers of the 2-hop and 3-color NoCs for the CMP sizes and link bandwidths we consider. For each pair of BW and CMP size, the table lists three values of area savings for three different PD sizes: 1x if constructed with a modified VCSEL with similar size [7] or a photodiode with areas stated as around 5-10x larger than a VC-SEL [2, 21] . The area savings increase with the number of cores, since the network resources of the 2-hop and 3-color NoCs scale at a lower rate than the resources of the 1-hop NoC. Notice that for BW = 3.33, the 3-color NoC needs a significant amount of additional, and thus is excluded from further evaluations. For the remainder of the scalability study we choose the following subset of parameters: We use configurations with 64-byte cache lines since it is the most common on modern processors, for example [3, 5] , BW = 2, 2.5, and r = 1, 5 to represent both low and high request injection rates.
Network Latency: The 3 NoCs experience small increases in average data packet latency with larger CMPs. The 2-hop NoC experiences a greater increase in packet latency than the 3-color NoC as can be seen in Fig. 9 , due to its fewer overall links. Comparing the normalized average packet latency on the 8x8 and 12x12 CMPs, we observed an increase of about 3.5% and 0.5% with the 2-hop and 3-color NoCs, respectively, for both BW = 2, 2.5.
Speedup: For BW = 2.5, we observed that the speedups of the different systems remain almost the same for the 3 CMP sizes. 3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3-color  2-hop  3- Link and Buffer Energy: With larger numbers of cores, more energy savings are achieved by the 2-hop and 3-color NoCs relative to the corresponding 1-hop NoC. In Fig. 11 we plot the normalized link and buffer energy consumption for BW = 2.5. We observed a 20% and 15.6% additional savings in link-energy by the 2-hop and 3-color NoCs on the 12x12 CMP compared to their energy savings on the 8x8 CMP, while their respective buffer-energy consumption increased by 5.2% and 0.9% relative to the 1-hop NoC. As mentioned above, the network resources of the 2-hop and 3-color NoCs scale with increased number of cores at a lower rate than the resources of the 1-hop NoC, which allowed more link-energy savings, but increased the utilization of the network buffers.
Evaluating with benchmarks
After exploring the impact of the different design parameters above, we evaluate the 2-hop and 3-color networks on a 8x8 CMP whose parameters are given in Table 5 . We evaluate the 2-hop and 3-color NoCs with BW = 2 and 2.5.
Speedup: For BW = 2, the geometric mean of the speedup of the 2-hop and 3-color NoCs is 7.6% and 8.4%, respectively, compared to the 1-hop NoC, while with BW = 2.5, their respective speedups are 8.4% and 9.2%. Link and Buffer Energy: As shown in Figs. 12 and 13 , for BW = 2, the 2-hop and 3-color NoCs consume on average 63.3% and 75.2% of the total link-energy of the 1-hop NoC, respectively, and consume on average 85.7% and 86.1% of total buffer-energy of the 1-hop NoC, respectively. While for BW = 2.5, the 2-hop and 3-color NoCs consume on average 68.6% and 83.4% of the total link energy of the 1-hop NoC, respectively, and consume on average 72.4% and 54% of total buffer-energy of the 1-hop NoC, respectively.
These results show that for BW = 2, there is little difference between the 2-hop and 3-color configurations with respect to performance and the network energy, thus making the 2-hop color more attractive since it is simpler and requires less area to realize. For BW = 2.5, the performance of both the 2-hop and 3-color systems improved by 0.8%. However the 3-color system reduced bufferenergy consumption considerably, but at the expense of additional area. 
CONCLUSION
With scaling of CMPs to hundreds and thousands of cores, scalable efficient intra-chip communication becomes more critical to performance. The ideal one-hop, all-to-all connectivity that may be achieved through free-space optics comes at a high cost of hardware and energy resources. Through careful allocation of resources, we propose alternate 2-hop strategies. We explore the tradeoffs of performance, network latency, and resource usage through extensive evaluation with a variety of cache line sizes, link bandwidths, network loads, and different CMP sizes. We find that our proposed 2-hop strategies are not only better scalable but also better performing compared to the 1-hop network. In our future work we plan to study and improve the scalability of both the data and narrow meta lanes. 
