ABSTRACT The network-on-chip (NoC) has been introduced as an efficient communication backbone to tackle the increasing challenges of on-chip communication. Nevertheless, merely metal-based NoC implementation offers only limited performance and power scalability in terms of multicast and broadcast traffics. To meet scalability demands, this paper addresses the system-level challenges for intra-chip multicast communication in a proposed hybrid interconnects architecture. This hybrid NoC combines and utilizes both regular metal on-chip interconnects and new type of wireless-NoC (WiNoC) which is Zenneck surface wave interconnects (SWI). Moreover, this paper embeds novel multicast routing and arbitration schemes to address system-level multicast-challenges in the proposed architecture. Specifically, a design exploration of contention handling in SWI layer is considered in both centralized and decentralized manners. Consequently, the hybrid wire-SWI architecture avoids overloading the network, alleviates the formation of traffic hotspots and avoid deadlocks that are typically associated with state-of-the-art multicast handling. The evaluation is based on a cycle-accurate simulation and hardware description. It demonstrates the effectiveness of the proposed architecture in terms of power consumption (up to around 10x) and performance (around 22x) compared to regular NoCs. These results are achieved with negligible hardware overheads. This study explore promising potential of the proposed architecture for current and future NoC-based many-core processors.
ABSTRACT The network-on-chip (NoC) has been introduced as an efficient communication backbone to tackle the increasing challenges of on-chip communication. Nevertheless, merely metal-based NoC implementation offers only limited performance and power scalability in terms of multicast and broadcast traffics. To meet scalability demands, this paper addresses the system-level challenges for intra-chip multicast communication in a proposed hybrid interconnects architecture. This hybrid NoC combines and utilizes both regular metal on-chip interconnects and new type of wireless-NoC (WiNoC) which is Zenneck surface wave interconnects (SWI). Moreover, this paper embeds novel multicast routing and arbitration schemes to address system-level multicast-challenges in the proposed architecture. Specifically, a design exploration of contention handling in SWI layer is considered in both centralized and decentralized manners. Consequently, the hybrid wire-SWI architecture avoids overloading the network, alleviates the formation of traffic hotspots and avoid deadlocks that are typically associated with state-of-the-art multicast handling. The evaluation is based on a cycle-accurate simulation and hardware description. It demonstrates the effectiveness of the proposed architecture in terms of power consumption (up to around 10x) and performance (around 22x) compared to regular NoCs. These results are achieved with negligible hardware overheads. This study explore promising potential of the proposed architecture for current and future NoC-based many-core processors.
INDEX TERMS Networks-on-chip, surface-wave, chip multiprocessors, on-chip interconnects, multicast routing, arbitration
I. INTRODUCTION
Multi/many-core processors had been introduced to provide near linear performance improvements as complexity increases (overcoming Pollack's rule) while maintaining lower power and frequency budgets [1] . Consequently, many-core era with hundreds of cores is upon us. However, a good utilization of such many-core architectures is becoming a challenge since the performance and power consumption of many-cores are bound by both interconnect fabric and cache coherence protocols. Although, networks-on-chip (NoC) have been adopted as a scalable underlying one-to-one communication structure [2] , cache coherence protocols inject a non-trivial percentages of multicast/broadcast packets. This one-to-many (1-to-M) traffic is ranging from 3 to 13 percent [3] .
In the literature, NoC conventionally treat 1-to-M traffic as repeated unicast traffic, which is referred to as software multicast. This basic handling of multicast will increase NoC power consumption and congestion. Consequently, even small ratios of multicast (1-to-M) or broadcast (1-to-all) will have severe effects on NoC such as high latency and fast NoC saturation, see Figure 1 . Many studies have proposed wire-based NoC schemes that support 1-to-M communication [3] , [4] . However, these studies struggles to match wire-latency and/or wireenergy. This will not be sufficient given the wire issues in terms of latency and energy for even unicast communication [5] .
As a result, many researchers are looking for alternative communication fabrics, such as radio frequency (RF)-based interconnects [6] , [7] , [8] and optical interconnects (ONoC) [9] , [10] . However, although such interconnects seem promising, they might not be the ideal solution due to their complexity, incompatibility, power consumption and/or area overheads [5] . The Zenneck surface wave (SW) [11] , [12] is an emerging wireless on-chip interconnect technology which is exploited in this paper to mitigate global 1-to-M communication issues. The remarkable potential of SW requires innovative designs at different levels of abstraction in order to be utilized for future on-chip interconnects. This paper investigates the potential merits of the SW in handling 1-to-M onchip communication and the associated challenges at the network abstraction level. A preliminary version of this work has been published previously [13] , [14] . This paper offers new revised and improved designs in terms of multicast handling along with comparative evaluation of the proposed approaches. The major contributions of this paper are:
To develop a hybrid wire-surface wave interconnect (W-SWI) architecture that exploits surface wave features for 1-to-M traffic handling. Moreover, the SW features and challenges are analysed and discussed compared to emerging state-of-the-art interconnects.
To propose arbitration mechanisms, routing scheme and communication protocols for 1-to-M traffic that efficiently address multicast traffic and maximize W-SWI utilization. In particular, in a design exploration of the centralized and decentralized arbitration techniques, they are demonstrated to have the ability to allow the concurrent utilization of many resources with relatively low circuit complexity and delay.
To evaluate rigorously the W-SWI for both synthetic traffic and real application benchmarks. The proposed architectures are found to surpass the previous work, such as regular mesh by achieving improvements ($ 22x) in average delay and ($ 2 À 10x) in power consumption with relatively insignificant additional hardware cost. The rest of the paper is organized as follows: Section II provides an overview of fan-out capability of emerging onchip interconnects. Section III presents a hybrid wire-SW interconnect architecture, a multicast routing scheme, a design exploration of SWI arbitration techniques. Section IV evaluates the performance, power consumption and area overhead of the platform. Finally, Section V draws the conclusions and suggests directions for future work.
II. BACKGROUND A. FAN-OUT CHALLENGES AND EMERGING INTERCONNECTS
Power efficiency decreases significantly proportional to the number of fan-out in cutting emerging on-chip interconnects. First, optical-based multicast architectures vary in topology and the on-chip devices that support them. For example, the tree-topology requires splitters and combiners to fork and join the optical signals [15] . Another example is a bus-based topology that utilizes wavelength-division-multiplexing (WDM) and then uses a bank of microring modulators, which can be configured to listen to a selected channel [9] . However, all these architectures have limited fan-out capability because the optical signal would decay after each forking or partial drop of the signal to a receiver node [9] , [15] . The number of nodes that can receive the signal depends on the signal power budget, which is considered to be relatively high. Second, RF-based interconnects (WiNoC, RF-I, and SWI) appears to be a cost effective alternative compared to optical interconnects since the RF circuitry require less complex implementation techniques and are less area and power-hungry. In terms of RF-I multicast architectures, many designs have proposed a worm or cycle layout of transmission lines (TLs) to pass through all the nodes. This layout involves a set of challenges such as adding nontrivial area overheads, signal decay and signal latency. In terms of fan-out feature, to distribute the signal in RF-I, the worm or cycle layout of these thick wires should go through almost every tile in the chip [6] , [7] . Although this add nontrivial area overheads because of the large pitch of the TLs (width and spacing), the main issue is the multicast scalability. This is due to fact that this layout might mitigate but not eliminate the impedance discontinuity. As a result, with each drop point, the signal are decayed, latency and signal reflections are increased unless careful matching circuits are designed [16] .
On the other hand, The WiNoC have natural scalable fanout capability which makes them preferable for 1-to-M enabled interconnect architectures. As a result, WiNoC had been suggested for multi/many-core with multicast requirements [17] . However, the WiNoC fan-out capability is limited by the antenna radiation pattern and coverage distance. The high power dissipation of the RF signal in the free space propagation leads to a low coverage distance to power ratio. Therefore, the transceiver power amplifier and the antenna design should take into consideration the required distance and the directions of the destinations. For instance, some studies have proposed run-time tunable transmitting power based on the required destination [18] . SWI is new type of WiNoC that will be discussed in next section.
B. ZENNECK SURFACE WAVE INTERCONNECT (SWI)
The Zenneck surface wave is an inhomogeneous 2D electromagnetic (EM) wave supported by a surface. The designed surface is a waveguide that traps the EM in two dimensional media instead of three dimensional free space. As a result, the electrical-field decay rate in the SW from the source horizontally along the boundary is around (1= ffiffiffi d p ) as shown in Figure 2 , where d is the distance from the source [12] . Thus, since the SW signal transmitted in all directions, the SWI interconnect offers natural efficient fanout features compared to other interconnects [19] . In addition, This low power dissipation allows the SWI to offer relatively linear J/bit over this short distance compared to the high scaling of regular global buffered wire interconnects. The surface should be engineered by altering its dimensions, and the materials of the conductor and/or dielectric are chosen so that the characteristic impedance (Z 0 ) will be around (10 þ j300) V. Thus, the surface medium can consist of either a dielectric coated conductor layer or a corrugated conductor surface [11] , [12] . For low fabrication costs and simple geometry, the dielectric-coated surface is preferable. The integrated surface can be realised using either silicon dioxide (SiO 2 ; " r ¼ 3:9) or ceramic (Al 2 O 3 ; " r ¼ 9:8), on a metal ground plane of thickness 1mm. In the case of millimetre-wave applications at 60 GHz or above, the thickness of the dielectric layer for silicon dioxide and ceramic will be 0:8 and 0:7 mm respectively, and the coating process can be integrated with a conventional semi-conductor fabrication process. In addition, due to the fact that the surface roughness of the dielectric layer will not be an issue when the operating frequency of the system is less than 300 GHz, no expensive highly polished wafer (Ra < 0:01mm). As a result, the cost of the additional process can be neglected.
Laboratory experiments [11] show that frequency bandwidth is limited only be the transceiver. However, integrated transceiver carrier frequencies are continuously scaling with the switching speed of the CMOS technology. This range of frequencies is necessary to allow multi-channel realization based on frequency-division-multiple-access (FDMA) at this shared media with the necessary frequency spacing to avoid channel interference. Thus, an integrated transceiver can be designed for a waveguided signal such as the one proposed and implemented by Chang et al. [20] or Carpenter et al. [7] . The communication channel is designed so that each channel has 32 sub-channels where each sub-channel transmits a nibble (4 bits) after it has been modulated using 16-QAM (quadrature amplitude modulation). This way the channel matches the data bandwidth of the baseline architecture wire link, see Section IV. Details of communication channel specifications are discussed in detail in previous work [21] .
In addition, a maximum transmission into the Zenneck surface wave occurs when the incoming wave is incident at or close to the Brewster angle, where reflections are minimized. Therefore, an integration of a transducer linked to the transceiver is needed to launch the waved signal into the surface [11] , as shown in Figure 3 . This can be as simple as, for omni-directional transmission, a coaxial to waveguide flange as described elsewhere [12] . Also, it could be a dipole or monopole for omni-directional communication, with a parallel plate waveguide [22] . The transducer layer can be fabricated separately and then flip-chip bonding and the throughsilicon-via (TSV) [23] technique are used to connect it to the integrated transceiver. The transducer and transceiver design is beyond the scope of this paper.
III. PROPOSED SURFACE-WAVE-BASED MULTICAST ARCHITECTURES A. HYBRID WIRE AND SURFACE-WAVE INTERCONNECT ARCHITECTURE
The SWI has significant advantages especially it terms of fan-out, as mentioned earlier. However, as with all RF-based interconnects, it suffers from limitations in terms of congested shared media and limited range of frequencies. These make it infeasible to completely replace metal wire interconnects in the near future. Moreover, in terms of wire-based interconnects, local communication seems to scale well with technology scaling unlike global communication [5] . In addition, this type of interconnect has the cheapest implementation cost compared to other fabrics. Therefore, the best solution would be to combine both metal and SWI in hybrid wire-SW interconnects in a multi-layered network architecture; in short W-SWI, as shown in Figure 4 . The first layer is a regular mesh topology, since the mesh is preferable for a general purpose interconnects architecture, suitable for chip FIGURE 2. Zenneck surface wave signal decay, which is significantly better than wireless free space signal decay [12] . floor planning, and have uniform manageable lengths of wires. On the other hand, the second layer is the surface wave bus topology. Thus, this architecture offers a natural fan-out feature, which is lost when the interconnects system changes from the bus to the NoC. In order to preserve the fan-out feature, all the routers in the NoC are designed to receive information through SWI. On the other hand, even though enabling all nodes to have transmission capability will increase connectivity, this would increase the contention on the SWI layer as the NoC size increases. In addition, multicast communication are relatively low but with a dramatic effect on NoCs performance, as mentioned earlier. Therefore, fewer nodes are selected to have the transmission capability to reduce the circuit overhead and comply with the available frequency bandwidth. These nodes will be referred to as masters, while the rest are referred to as slaves. These slaves can only receive data but they may transmit some control signals, see Section III-E. Masters are distributed so that the average hop count (Manhattan distance) from all slaves to the nearest master is at a minimum. Such placement of the master nodes reduces the average hop count of the overall on-chip network. In addition, it would allow each master to be accessible with minimum number of hops via wires and routers for critical traffic such as the 1-to-M.
The wire-based NoC mesh topology with software-multicast is the baseline architecture, and we will refer to it as the MeshS. For the W-SWI architecture proposed in this study, a sixth port needs to be added to the router along with all related control circuits. Also the crossbar switch size needs to be adjusted according to the new requirements. The new data path port is linked to a transceiver for master nodes or receiver for slave nodes.
B. HYBRID MULTICAST ROUTING SCHEME
Multicast traffic benefits from the characteristics of the SWI and the architectural layout, as mentioned in previous sections. However, a smart routing technique is required to direct and deliver multicast traffic to its final destinations and to maximise the benefits at minimum cost. In this work, the routing for the proposed architecture is an improved treebased scheme where the embedded tree path forks at one point; specifically, the nearest master. Therefore, the maximum degree of routing graph is up to (N À 1); thus, the need for the SW fan-out feature. The nearest master then delivers concurrently the flit to all the addressed leaves (slaves) in one hop via SWI as illustrated in Figure 5 . Therefore, it provides higher efficiency in handling 1-to-M traffic, for the following reasons. First, each node simply needs to direct the 1-to-M traffic to the nearest master using any routing algorithm (we used a simple partially adaptive algorithm called the odd-even since it offers path diversity [24] ). Hence, there is no need for extra circuitry or complicated algorithms to build the multicast tree path and to determine the forking points. Second, due to one forking point in routing path (the nearest master), packets will be replicated only at the destination routers. This will reduce power consumption by eliminating the need for duplicated traffic to travel through costly (power hungry and already loaded) intermediate wires and routers.
In order to direct the packets to the destinations, each multicast packet header must have multicast-address-bits (MAB). This header field is a bit vector where each bit represents a node, and it is set if the node is a multicasting group member. Since multicast traffic may be generated in a slave node, thus if the nearest master is part of the multicast group, it must transmit the flit via SWI first before draining the flit by the local PE. Last, although this approach is simple and efficient, 1-to-M routing is a major topic and many novel approaches can be further explored. However, it is out of this paper scope.
C. MULTICAST CHALLENGES FACING THE PROPOSED ARCHITECTURE
The SWI layer can be represented logically as a multi-bus topology with multiple master nodes with Tx/Rx capability and multiple slave nodes with Rx capability. Each master has its own dedicated physical bus (frequency channels). However, each slave can receive from one master at a time, which creates competition between masters. This competition escalates as the average destinations of multicast flows and the number of master increase. As a result of this competition, two scenarios might develop: channel starvation and a multicast dependency deadlock. The first issue results when master(s) win the allocation of slaves repeatedly while other master nodes are waiting. The second scenario can be explained in an example as shown in Figure 6 . This figure demonstrates the deadlock problem resulting from multicast dependencies between two masters for D 3 . This scenario will cause a deadlock situation, since each master will not release its allocated slaves unless it delivers the rest of the packet flits. For unicast traffic, the SWI utilization problem is mastercentric where utilizing masters transmitter determines the performance. However, for multicast, this problem is shifted to be slave-centric where the utilization of the slaves receiver would determine the performance. Therefore, the use of virtual channels (VCs) for the SWI (VSWI) might offers the best utilization of SWI since it enables slaves to listen virtually to more than one master. This way, if a master is waiting for a message to be delivered to the rest of slaves or this message needs to be drained locally by the same master, idle reserved slaves will not be prevented from accepting traffic from another master. This complicates the allocation problem in finding a legal match between masters Â VSWI Â slaves (three dimensions) so that no two masters are allocated the same VSWI for the same slaves simultaneously. Therefore, to offer fair deadlock-free arbitration while efficiently utilising the W-SWI, a set of solutions is proposed in the following sections.
D. DEADLOCK-FREE CENTRALIZED CHANNEL ALLOCATION
This section presents the design of the global multiresources arbiter (GMA) and its rationale in addressing the contention problems mentioned above in a centralized approach. The resulting hybrid architecture with GMA will be referred to as W-SWI-C. To avoid the multicast cycle dependency scenario mentioned earlier, slaves could be allocated as a group. This can be achieved in the arbitration request masking stage by using the MAB-Check unit. This unit will not validate a request from any master unless all the slaves that have been requested are free by comparing the request with the content of a GMA reservation table. In this way, the arbitration problem is also minimised from the three-dimensional matching masters Â VSWI Â slaves to the two-dimensional matching masters Â VSWI.
The main crucial feature in multi-resource allocations is legal matching where an output is assigned to one input and vice versa. Moreover, in order to minimise the decision latency, high arbitration parallelism is required. These two features can be achieved between the vector of inputs, which represents the masters, and a vector of outputs, which represents the VSWI, by using two sets of arbitration: one for the input and one for the output. However, this is likely to lead to a poor legal matching or minimum matching, where less than optimal possible resources have been allocated. Optimum legal matching (maximum matching) can be achieved by adopting a lonely (or least-requested) output allocator (LOA) that introduces one more stage before input arbitration [2] . This extra stage counts the number of valid requests for each output in order to detect the level of competition over each output. Then, the less popular output will be given higher priority in the next arbitration stage. This should minimize conflict and produce maximum matching whenever possible. Figure 7b shows the structure of the proposed GMA which achieves the best legal match in two cycles (given that the requested resources are free) and with remarkably low circuit complexity. When the request is received from masters via SWI, they are demodulated and the request data are extracted, such as the requested destination(s) (MAB), a time-stamp and a source-ID. The first stage is a request masking in order to check if the master's request is possible for any of the VSWI by comparing its MAB with already reserved resources in the reservation table. The next three stages (2-4) represent the LOA which achieves the maximum matching of master-to-VSWI by minimising conflict between Design of the proposed global multi-resource arbiter (GMA) for SWI channels: stage (1) request masking; stages (2-4) achieve legal match with lonely output allocator [2] ; stage (5) generates the grant signals for a fixed period. The figure also shows an example of GMA stages 1-4 with four masters and two VCs. Master (M 4 ) related logic is not drawn, for simplicity, but it is currently allocating some of the slaves requested by M 2 . VOLUME 6, NO. 3, JULY-SEPT. 2018 361 master requests over VSWIs. This is accomplished in stage 2 by counting the valid requests for each VSWI, and then generating the priority signal that will prioritize a VSWI that is subject to less competition. Afterwards, in stage 3, each request will elect one of the VSWIs to compete over it. This elected VSWI should has less competition over it and the slaves requested are currently free. The final stage of LOA is stage 4, where the oldest request competing for each VSWI will be the winner out of comparator tree arbiters. The LOA is followed by stage 5, where the winning request from earlier stages will be stored in a reservation table. The size of this table is proportional to the NoC size, the number of master nodes (SWI channels), and the number of VSWIs.
The final stage 6 represents physical channel allocation. This stage alternate grant signal among subgroups of reserved slaves limited number of clock cycles. Each sub group of slaves reserved under the same VSWI. This stage utilizes the VSWI to provide higher performance by allowing non-conflicting masters to transmit at the same time using their own channel frequencies. This stage consists mainly of a simple arbiter such as a round-robin arbiter (RR). The duration of the allocation can be tuned so that the allocation period is either (1) for one cycle, (2) for a fixed period, which would need a frequency divider, or (3) for as long as the request is asserted, which would need a hold release mechanism [2] , in short, Hold. Tuning between these options mainly depends on traffic pattern and system-level evaluation, as discussed in Section IV-A. The final step is where the output is stored in an allocation register, and will be transmitted as a grant signal through an SWI-specific frequency control channel. The time cost of the arbitration in case of winning the arbitration is two clock cycles. Otherwise, the delay would equal to the arbitration cost plus a blocking period (T b ), which could be up to Packet size Â N v , where N v is the number of VCSWs if there is no congestion in the slaves.
To illustrate the GMA functionality, Figure 7b also shows an example of the first four stages of the GMA that serves four masters with two VSWIs. However, the logic related to the forth master (M 4 ) is not shown, for simplicity, because it is currently merely reserving S 3 via VSWI1. Masters M 1 ; M 2 ; and M 3 have requested the slaves fS 1 ; S 2 g; fS 1 ; S 2 g; and fS 2 ; S 3 g respectively. Since S 3 is already reserved via VSWI1, E11 is the only signal deactivated (not color red) by the MAB-check unit. In stage 2, therefore, the output priority (P) will be for the competition over the less-requested VSWI, which is VSWI1. Then, through the priority arbitration in stage 3, M1 and M3 will compete over VSWI0 while M2 will compete over VSWI1 alone, which are highlighted in blue. The winner of stage 4 (W1) will be the master request with the oldest time-stamp (TimeS).
The communication protocol among masters, slaves and the GMA taking place at the SWI level is shown in Figure 7a . In order to utilize the limited available frequency bandwidth, the master interface sends a request on the same master data frequency channel (channel establishing phase). This request is identified by the header ID (H ID ) which distinguishes between request, release and data flits. In addition, the request packet consists of the data required for the arbitration process, such as the MAB, TimeS and Source ID (S ID ). When the arbiter grants the request, it will generate two types of signals, which are the master and slave grant signals that grants their requests for the next cycle. These signals consists of two parts: the grant/release bit (G=R) and the requested master number (M ID ) that will inform the slave which channel it should listen to. After these signals are received, the data handshake phase starts. The master TxRx sends a data flit and waits for all the slaves in its multicast group acknowledge signals. Thus, the rest of the control signals require a bandwidth of 104 bits (for 4 masters and 24 slaves).
E. EFFICIENT DECENTRALIZED RESOURCE ALLOCATION
This section proposes an alternative technique to handle multicast challenges in the proposed architecture, which is stretched multicast. This approach enables a master to transmit multicast flow to any number of currently free slaves and to retransmit it to the rest later. Although this means partial retransmission, it allows the concurrent execution of several overlapping multicast communications. Consequently, the decision should be determined at the slave end in a decentralized manner. This can be realized by any simple independent fair arbiter in each slave. Round-robin (RR) has been chosen since it provides stronger fairness than rotating or random arbiters and requires less circuit complexity than matrix and queuing arbiters [2] . There are many possible scenarios where this scheme will show better contention handling, fairness, and higher SWI utilization. For instance, Figure 8 shows a comparison of contention handling of W-SWI-C with two VCSWs and a decentralized architecture (W-SWI-D). Clearly, even though we assume that the W-SWI-C manages to allocate all MG i at T 1 , it offers less fairness and the multiplexing between flows is limited to the packet level. However, since masters can allocate a subgroup of destinations, it might cause multicast dependency scenario, see Section III-C. In order to break multicast dependencies without the need to allocate all the requested slaves (MG i ) at a given time slot, each master should have its own virtual nonblocking channel for every slave. Thus, since each master already has its own physical channel frequency, virtually non-blocking channels can be achieved at the slave router by using a statically allocated VC where each master transmits via one VC (N v ¼ N M ). However, we developed a more efficient solution in terms of power and area overhead called IDtagging-based flow-control (Tag). This technique simply consists of tagging each flit (F) with the transmitter master's ID (M x ) so that Tag ¼ M x , where Tag; M x 2 f1; . . . ; N M g. Then, at the reservation table (RT), the allocation entry is distinguished based on the input buffer (i), Tag and the input VC (V i , if the design include VCSW). For simplicity the ID-tag remains unchanged in the router output port while being drained for the local processing element (PE). Figure 9 shows an example of a router micro-architecture providing two virtual non-blocking channels by using either ID-tagging or perminantly allocated VCs. Obviously, ID-tagging allows designers to choose a virtual-channelless design, which requires less area and power overheads. For this example, if buffer resizing are limited to the router port linked to SWI, then our calculations estimate a reduction in area of $ 2 percent, and in power of $5 percent by using ID-tagging. Moreover, ID-tagging gives the freedom to use the VC for other purposes such as multiplexing flows from the same master (VCSW) in cases of congestion.
To prove that this scheme breaks the multicast dependencies on the SWI layer, assume that we have requests (Rq i;j 2 RV i for a M i to any slave (S j2f1;...;Ng ). These requests should be granted within a finite time (T f ). This is to prove, that for all Rq i;j ¼ 1, then Gr i;j ¼ 1 within T f , where Gr i;j is a grant signal. The probability that a slave local arbiter (RR) grants a master's request at the next time slot T x is P T x ðS j Þ ¼ 1 N M . In the worst case scenario, M i has just been granted in T xÀ1 and all masters are requesting S j at T x . Thus, M i has to wait until the RR arbiter of S j grants all other masters. Assuming the average delay to serve each master request is T D . Therefore, the maximum waiting period is ¼ N M Â T D and thus P T xþN M ÂT D ðjÞ ¼ 1. Therefore, the serving time ST i needed for a flit to be delivered to all slaves and ejected from the output buffer in M i is: The communication protocol for the decentralized approach has fewer control signal types than that in the centralized approach, but with slight composite interface procedure. Algorithm 1 shows the procedure used for master and slave interfaces. A master (M i ) interface sends a RV i via SWI that consists of a request (Rq i;j ) to each slave (S j ) wherever MABðjÞ ¼ 1; see lines 1 to 7. At the slave end, the local RR arbiter will determine which master it will listen to in the next time slot and send the Gr i;j signal to, see lines 4 to 7. Both 363 Rq i;j and Gr i;j are transmitted via SWI using On-off keying (OOK) modelling for simplicity. When the RR grants the request, the data handshake phase starts by sending the data flit to all slaves who responded to the request and waits for them to acknowledge reception before resetting the requests intended for them. The adopted handshaking protocol is a non-return-to-zero protocol [25] . The master interface keeps track of which slave has been served by updating the MAB of the current flit, as shown in line 10. Then, if all MG i members have received the multicast flit, it will be ejected and a new set of requests might be sent based on the next flit MAB.
F. THEORETICAL ANALYSIS AND COMPARISONS
The previous two sections presented the techniques to address SWI layer multicast challenges. In this section, these techniques are analytically compared and discussed to serve as ground for understanding the evaluation results in the following sections.
In the centralized approach, a master has to wait until all the requested slaves (multicast group) are free in order to avoid deadlock. To mathematically express the problem in this approach, assume a master (M i ) is sending a request vector (RV i ) to the GMA to allocate a set of multicast-group (MG i Þ & fS 0 ; . . . ; S NÀ1 g, where S j is a slave, and N is the NoC size). The probability that a slave (S j 2 MG i ) is free in time slot t is denoted by P t ðS j Þ. Therefore, the probability that the M i request (Rq i;j ) will be granted is the intersection probability ðP t ðS x Þ Ã P t ðS y Þ::: Ã P t ðS z ÞÞ. This is clearly less or equal to P t ðS j Þ for all S j 2 MG i , which is the case of W-SWI-D. This will keep free requested slaves idle until all the members of MG i are free.
In terms of blocking period, although the same flit might be retransmitted (up to N M , where N M is the number of master nodes), the stretched multicast offers higher fairness and utilization of the SWI. This is because the reallocation of slaves on flit bases prevents flows from blocking each other. In contrast, the blocking period (Time block ) in W-SWI-C can be predicted to be: (2) where N v is the number of VC, Slice is time slot per flit, Time congestion is the duration of any congestion in the master (the channel owner) or the slave that would cause idle time slots. However, Time congestion might equal zero if there is no heavy load at either ends. Thus, the stretched multicast improves fairness by reducing the blocking period between flows. However, the overhead of channel establishment per flit in decentralized approach might overcome the SWI utilization improvements. Table 2 summarize the comparison of the main features of the centralized and decentralized approaches. These features will rationalize some of the results in the next section.
IV. SYSTEM LEVEL EVALUATION AND DISCUSSION
This section presents results obtained from our cycle-accurate NoC simulator which was built by modifying the existing Noxim simulator [26] for the W-SWI-C, W-SWI-D, virtual circuit tree multicast (VCTM), and the baseline architecture (MeshS). The MeshS in this paper refers to a wirebased regular mesh NoC that manages the 1-to-M traffic as a software multicast. In this paper, the Intel single-chip cloud computer (SCC) [27] is adopted as the baseline architecture. This chip is designed for performance critical many-cores, which makes it optimal for the purpose of this study. Table 3 shows the modified tile specifications. Packet sizes of 1, 4 and 12 flits were chosen as an example to demonstrate the behaviour of the proposed architecture under packet-switching, virtual-cut-through-switching and wormhole-switching flow control respectively. The number of master nodes is based on the available frequency range for 45 nm, which was estimated to be four channels (plus the frequencies specified for control signals). However, this frequency range is scaling This system tile size is 3:6 Â 5:2 mm 2 .
364 VOLUME 6, NO. 3, JULY-SEPT. 2018
with technology [6] . As a result, the number of master nodes was increased when simulating larger NoCs, assuming that the technology will have been scaled too. In addition, a VSWI number of four was chosen, which realizes a better performance/cost trade-off. In addition, in the evaluation in this section the VC was chosen to be equal to the VSWI for simplicity of router architecture.
The simulation was conducted with synthetic traffics, which are: (1) Random, where packets are transmitted randomly with uniform probability to other nodes; (2) Hotspot, which is the same as the random but with specific nodes called hotspots, four in this case, with higher probability of traffic dispatched to them; (3) Transpose, where a node sends a packet to other node that has its address transposed. These synthetic traffic adjusted to inject a specific percentage of broadcast (1-to-all) or multicast (1-to-M). The source nodes of this multicast traffic are selected randomly during the simulation, while the rest of the traffic consist of normal unicast packets according to the named synthetic traffic. In addition, in the case of multicast, the destinations of these packets are also selected randomly. In addition, the evaluation of the proposed architecture and baseline architecture includes real application benchmarks whose details are shown in Section IV-D.
A. PERFORMANCE IMPROVEMENTS
This section presents a performance evaluation of the proposed architecture under synthetic traffic. Figure 10a shows much less average delay for the W-SWI-C over the MeshS with Random traffic consisting of 10 percent broadcast and 90 percent unicast. Even for zero-load-latency (ZLL), the average delay improvement is $ 22x. Similar improvements are obtained with Transpose ($ 24x) and Hotspot traffic ($ 21:8x). Obviously, further improvements can be reported as the PIR increased since MeshS is saturated before W-SWI-C, as shown in Figure 10a . These significant improvements are due to the software multicast in MeshS that replicates the multicast traffic to all destinations. Thus, it will increase the load and hotspots on the NoC. In contrast, In W-SWI, packets are replicated at the multicast destination routers and send through a short-cut links that avoid costly intermediate routers and wire links. In addition, when multicast traffic is used, the resulting improvements were up to 12Â over MeshS. This is due to the relatively lower load and hotspots caused by multicast compared to broadcast.
On the other hand, Figure 10b shows that W-SWI-C with a Hold allocation is better than fixed period allocation with 10 percent multicast, see Section III-D. This is due to the fact that the hold-release mechanism eliminates the reallocation delay. Figure 10b also shows that, as the allocation period is increased, performance starts to decay because of the inflexibility of resources time scheduling. Figure 10c shows a comparison of W-SWI-C and W-SWI-D for different values of N v . Clearly, the proposed W-SWI-D shows slightly better performance than W-SWI-C with one VC (with an improvement of $5 percent). This is due to the fact that with one VC, ID-tagging allows each master to have a virtually non-blocking channel. Moreover, the performance results show that even for higher N v , the W-SWI-D is better before reaching 1.5ÂZLL. However, after this point the multiplexing delay between masters starts to overcome the improvements in the SWI utilization. Therefore, the W-SWI-C curves show more linear performance against the increase in the PIR. Figure 10d demonstrates the effect of increasing the number of masters in W-SWI-C: Hold under 10 percent multicast. Clearly, the performance in general improves as the number of SWIs is increased. However, the W-SWI-C with two masters and two physical channels (SWI:2) seems to slightly outperform the W-SWI-C:3. This could be because the increase in SWI channels will increase contention on the shared medium and the arbitration delay might impact on the performance. Thus, the optimum number of masters is not always the highest. In this work, the SWI:4 is chosen as the design parameter since it offers the best performance/cost trade-off, in addition to the fact that it copes with the frequency limit for the 45 nm technology.
B. POWER REDUCTION
This section presents the evaluation results power consumption. The router's static and dynamic power is calculated using Orion 2.0 [28] area and power models including the extra RR for SWI in the case of W-SWI-D. The modelled router power has been calibrated to match the reported power measurement of the implemented NoC [27] . In addition, Power dissipation for wire links is calculated for the horizontal links (3.6 mm) and the vertical links (5.2 mm), according to the SCC measurements [27] . The transceiver (TxRx) power consumption projection [6] , [20] is used, which is calculated to be 24 mW per sub-channel. The SWI power dissipation is also calculated based on the analytical model introduced previously [21] . The GMA was designed using Verilog and then synthesized using the Synopsys Design Compiler and mapped onto the PDK 45 nm technology library to calculate its dynamic power (4.8 mW) and leakage power (59.3mW). Then all these values were used in adjusted Noxim to calculate the overall NoC power consumption. Figure 11 shows the ratio of the MeshS power consumption over the power consumption of the W-SWI-C at PIR ¼ 2 Â ZLL for different NoC sizes, synthetic traffics and percentages of 1-to-M. Significant improvements in the NoC power consumption reduction ratio are demonstrated. For instance, the power ratio of MeshS to W-SWI starts from more than double ($ 2x) and increases up to $ 10x as the NoC size and the broadcast percentages are increased. However, less improvement appears in the case of multicast. This is because the multicast group members are fewer, which reduces the utilization of the SWI fan-out feature. Nonetheless, it still shows remarkable improvements, increasing from $ 1:5Â to $ 2:3Â proportionally to NoC size and 1-to-M ratio. On the other hand, the W-SWI-D achieved generally lower improvements than W-SWI-C due to retransmission and the arbitration overhead. However, the W-SW-D might outperform the W-SWI-C in case of a low load ($1:5 Â ZLL) due to the higher SWI utilization, as shown in our previous work [14] . In general, these new findings prove that the W-SWI has a remarkable scalability and effectiveness in mitigating 1-to-M communication issues.
C. COMPARISON WITH RELATED WORK
In this study, one of the state-of-the-art wire-based NoCs with a tree-based multicast scheme was replicated in order to compare it with the proposed architecture. This scheme is the VCTM [3] . It has been chosen because of its efficiency and simplicity, where a minimum of modifications to the baseline router architecture are required since the VCTM basically enables a mesh NoC to have a tree-based routing capability.
These modifications mainly include the VCTM table and the control circuits for the forking of one flit/cycle. These features will provide a fair comparison with the proposed architecture. This scheme is based on assigning one of the VCTM table entries in each router to every new multicast group with a unique source. Then, packet forking and routing is conducted according to the VCTM table. This look-up-table needs a setup stage to define its content for each new multicast group introduced to the NoC. The set-up stage use software multicast and the table entry is cumulatively set up. Thus, the authors acknowledged that interconnect performance is based on the ratio of the VCTM table size in the router to the number of unique multicast groups injected to the NoC [3] .
A big limitation of the VCTM is the inability to handle wormhole traffic due to deadlock occurrence. Therefore, the chosen packet sizes were one and four flits to give a fair demonstration of the NoC's performance compression under packet-switching and virtual-cut-through-switching respectively. In addition, the VCTM is limited to a turn-model routing with no path diversity. Otherwise, deadlock problems could appear because diversity might introduce cycles when building the multicast tree in the set-up stage. In contrast, path diversity is a favourable feature in our architecture and has been tackled by using odd-even routing, see Section III-B. Therefore, XY routing was used in all of the simulations in this section in order to provide a fair comparison. Moreover, according to Jerger et al. [3] , a VCTM with 512 entries per source offers good performance/cost levels in most cases. Therefore, a VCTM with 512 entries was considered in all our evaluations. Table 4 shows a performance comparison of the W-SWI-C and the VCTM architectures with different multicast ratios. The multicast source and group members were selected randomly. The average delay and PIR is reported for the NoC saturation edge where the average delay is double the ZLL. In this PIR, the proposed architecture shows steady improvements of around $ 1:8X and $ 5:5X for one and four flit packet sizes respectively. Although the proposed architecture performs better with wormhole or virtual-cut-through switching, its performance is still almost double that of the VCTM under basic packet switching.
D. EVALUATION BASED ON REAL APPLICATION BENCHMARK
In order to demonstrate the effectiveness of the proposed architecture for real applications, a set of application benchmarks from a standard suits, which are PARSEC [29] and SPLASH2 [30] , have been considered. These benchmarks are built based on the traffic analysis of communication trace-files generated from the many-core simulator [31] where all the benchmark applications were run with MESI cache coherence protocol. This protocol is well-known and used in many multi-processor systems [32] . As a result, based on this traffic analysis a synthetic traffics have been built, which have the same injection rates, packet size and source/destination(s) of each multicast and unicast traffic flows in these application benchmarks. These synthetic traffics then run for a million cycles with our cycle-accurate system-level NoC simulator. Figure 12a presents the performance improvement gained using the proposed W-SWI-C and W-SWI-D architectures over the MeshS for a NoC size of 10 Â 8. In general, the average delay improvements of the W-SWI-C and W-SWI-D over MeshS are almost similar and range from $5 to $99 percent. Moreover, these improvements are clearly proportional to the percentage of the multicast's ratio from the total PIR (4 to 14.2 percent). An exception is the case of the blackhole benchmark, where the improvement over MeshS is around 99 percent, even though the multicast ratio is 7.8 percent. This is due to the nature of the traffic hotspots (specifically multicast source hotspots) that cause the traffic source to quickly become saturated. In contrast, the proposed architecture's have more ability to effectively alleviates such traffic source overload and therefore reduce the serialization delay of multicast traffic into separated unicast traffic. On the other hand, Figure 12b also demonstrates the average energy/flit improvements over MeshS, where a flit is 128 b as mentioned earlier. Once again, W-SWI-C and W-SWI-D achieved better rates of energy/flit over MeshS of up to $ 10 percent and the improvements are proportional to the multicast percentage. However, most of these benchmarks run with relatively low PIR. Therefore, as shown in Section IV-A, the W-SWI-D outperform the W-SWI-C under low load. As a result, even though the W-SWI-D uses retransmissions that increase power consumption, it is better than W-SWI-C in some benchmarks in terms of power. In general, these results prove the potential of the proposed architectures for future NoC-based many-cores.
E. AREA OVERHEAD EVALUATION
It is essential to evaluate chip area overheads for the extra onchip circuits required for the proposed architecture. First, it is assumed that the active area calculated for transceivers in previous research [6] is the only part scaled down when moving to 45 nm technology, while the passive parts remain almost the same since they are proportional to the channels' operational frequency range. Therefore, the projected transmitter area is 4;870 mm 2 per sub-channel, while the projected receiver area is 260 mm 2 per sub-channel, where the active area is proportional to the square of the scaling factor [33] . Second, the area of baseline router and the extra router port (buffer, crossbar and related circuits) is calculated using the Orion 2.0 [28] model as 0:427 mm 2 . The modelled baseline router area is a 6 percent less than the reported implemented router area [27] , which is acceptable for the purpose of comparison evaluation in this paper. Third, the GMA (for W-SWI-C) and RR (for W-SWI-D) was designed using Verilog and then synthesized using the Synopsys design compiler and mapped onto the PDK 45 nm technology library to calculate its area. Their area was found to be 0:0114 and 0:0002 mm 2 , respectively, and to which the TxRx (for control signals) estimated area of 0:0438 and 0:0307 mm 2 , respectively, was added. Likewise, the VCTM with a 512 entry/source lookup table was designed using Verilog and then synthesised. Moreover, to compare other emerging interconnects, the RF-I's transmission line area was calculated and considered to be routed through the chip (NoC size 6Â4) as a U shape passing through all nodes [6] . A transmission line with 12 mm pitch has been considered in calculations of RF-I area overhead. Table 5 shows area overhead breakdown for the MeshS, proposed W-SWI-C, W-SWI-D, VCTM and RF-I. RF-I, has been chosen as an example to compare with emerging interconnects, full comparison with other interconnects is shown previous study [19] . Obviously, most of the extra area overhead for the W-SWI-C and W-SWI-D, of around 1.9 percent is due to the extra router port. However, the W-SWI-D area overhead is higher than the W-SWI-C ($ 2:4%; $ 2%, respectively). This is mostly due to increasing the VC allocation unit area in all routers to implement the ID-tagging scheme. Moreover, the W-SWI-C offers a better die area-performance trade-off compared to RF-I transmission lines that offer the same connectivity [6] , since fat transmission lines need to be implemented through the chip. Not only that, but the W-SWI-C also beats the VCTM in area overhead (around five times less). Therefore, the W-SWI-C succeeds these architectures in terms of low area overheads. FIGURE 12. Comparison between the average delay and energy improvements of W-SWI-C and W-SWI-D over MeshS under real applications benchmarks from PARSEC [29] and SPLASH2 [30] for 10Â8 NoC. 
V. CONCLUSION AND FUTURE WORK
This paper tackle the 1-to-M traffic issues efficiently using the hybrid wire-SWI architecture for on-chip communication. Zenneck surface wave low power dissipation, high signal propagation speed and fan-out capability all contributes to significantly mitigate the 1-to-M communication issues that the NoC-based many-core processors in particular suffers from. In addition, novel, efficient, and deadlock-free centralized (W-SWI-C) and decentralized (W-SWI-D) arbitration and allocation techniques along with a multicast routing scheme for this architecture are proposed and discussed. The evaluation results show significant improvements in terms of average delay, saturated PIR and power consumption with a relatively small die area penalty compared to state-of-the-art-architectures. Moreover, the comparison of the W-SWI-C and the W-SWI-D has proven that the former is preferred for higher traffic loads while the latter is optimal for low traffic loads. In general, the results demonstrate the high scalability of the W-SWI for the many-cores era. Future work should include the investigation of many-to-one traffic patterns. 
