Abstract-Computing modules in typical datacenter nodes or server racks consist of several multicore chips either on a board or in a System-in-Package (SiP) environment. State-of-the-art inter-chip communication over wireline channels require data signals to travel from internal nets to the peripheral I/O ports and then get routed over the inter-chip channels to the I/O port of the destination chip. Following this, the data is finally routed from the I/O to internal nets of the destination chip over a wireline interconnect fabric. This multihop communication increases energy consumption while decreasing data bandwidth in a multichip system. Also, traditional I/O does not scale well with technology generations due to limitations of pitch. Moreover, intra-chip and inter-chip communication protocol within such a multichip system is often decoupled to facilitate design flexibility. However, a seamless interconnection between on-chip and off-chip data transfer can improve the communication efficiency significantly. Here, we propose the design of a seamless hybrid wired and wireless interconnection network for multichip systems with dimensions spanning up to tens of centimeters with on-chip wireless transceivers. We demonstrate with cycle accurate simulations that such a design increases the bandwidth and reduces the energy consumption in comparison to state-of-the-art wireline I/O based multichip communication.
Ç

INTRODUCTION
C OMPUTING modules with multiple multicore chips are all-pervasive in hardware infrastructures from servers to datacenters. An example of such multicore multichip module is the IBM Power Series which is a processor system designed for sophisticated clusters. Due to scaling up of number of individual computing nodes by several orders of magnitude in these systems, interconnection between them has become increasingly complex. Moreover, in High Performance Computing (HPC) environments like datacenters or servers, lower level cache is physically distributed between all cores. Hence, cache or memory access eventually requires communication between components in different chips. While intra-chip communication infrastructure is seeing a paradigm shift from bus-based systems to Network-on-Chip (NoC) architectures [1] , inter-chip communication is also evolving at a rapid pace to cater to increasing bandwidth demands within strict power envelopes. Inter-chip interconnections vary from solder bumps or C4 interconnects in multichip modules within a Systemin-Package (SiP) spanning 10 cm in range on one end to Ethernet used in datacenter warehouses spanning about a kilometer on the other as shown in Fig. 1 . Other chip-tochip interconnects such as Peripheral Component Interconnect (PCI) is the most common standard local I/O bus technology to interconnect board-level multichip modules. PCI express (PCIe) is presented as next generation I/O technology. For improved scalability, latency and bandwidth requirements in large HPC environments the use of specialized hardware and communication protocols, such as InfiniBand or Myrinet is commercially used.
Recent trends according to the International Technology Roadmap for Semiconductors (ITRS) (http://www.itrs.net/) predict that the pitch of the I/O interconnects in ICs is not scaling as fast as the gate lengths or pitch of on-chip interconnects. This implies a gap in density and performance of traditional I/O systems relative to on-chip interconnections. The wiring complexity of both on-chip and off-chip interconnects exacerbates the problem by posing design challenges, crosstalk and signal integrity issues. Additionally, because of different interconnection frameworks for on-chip and off-chip communication, data from cores located within the chips need to travel to the I/O blocks, traverse the inter-chip link and then be routed to the final destination inside the destination chip. Besides, switching between protocols is necessary if the off-chip communication protocol is different from the on-chip one. All these factors reduce the efficiency in terms of energy consumption as well as latency and bandwidth of the data transfer between cores in a multichip system. Integrated inter and intra-chip photonic interconnections [2] and vertically integrated monolithic 3D ICs [3] are promising solutions to the off-chip interconnection challenges of traditional I/O. However, the pitch of photonic interconnects do not scale well due to the limitations in size of silicon-photonic devices. On the other hand, 3D Integration requires sophisticated thermal management techniques and suffers from the issue of low yields due to vertical misalignment of the layers.
Research in recent years has demonstrated that on-chip and off-chip wireless interconnects are capable of establishing radio communications within as well as between multiple chips. Wireless data communication links up to 10m in length with multi GigaHertz bandwidths in millimeterwave (mm-wave) bands are fabricated and demonstrated [4] . Using such on-chip antennas embedded in the chip wireless NoC architectures have been proposed [5] . These wireless NoCs are shown to improve energy efficiency and bandwidth of on-chip data communication in multicore chips [6] . In this work, we propose to use wireless interconnects to establish a seamless communication backbone which enables data exchange between cores in a single chip as well as between chips in a multichip system with dimensions spanning up to ten centimeters. The same communication protocols used for on-chip data transfer in the intrachip NoC will be used for off-chip data as well, eliminating the need for protocol transfer. Few cores inside the chips will be equipped with wireless transceivers, which will be capable of establishing direct one-hop communication with other such cores in the same as well as other chips. By deploying the wireless transceivers in the internal nodes of the chips such that all cores are within a short distance from their nearest transceivers, energy-efficient inter and intrachip communication can be achieved. Here, we present the design methodologies for such multicore multichip systems and demonstrate that the proposed design outperforms traditional wired I/O based multichip systems through system-level simulations. The scope of our proposed communication framework is encircled in Fig. 1 .
The specific contributions of this paper are:
1. Proposed two different interconnect frameworks to utilize wireless interconnects for seamless inter and intra-chip communication.
2. Design of suitable on-chip antennas to establish wireless interconnection in a multichip system. 3. Evaluated the performance of wireless multichip system and compare it with traditional I/O based multichip system. 4. Proposed a methodology to deploy wireless interconnects when system scales up. 5. Comparative evaluation with respect to emerging multichip integration technologies. The rest of the paper is organized as follows: Section 2 discusses most recent and relevant related works, Section 3 describes the proposed wireless multichip interconnect framework, Section 4 presents the evaluations of the proposed inter and intra-chip communication fabric and finally Section 5 concludes the paper.
RELATED WORK
According to ITRS, the pitch of chip-to-chip I/O does not scale in the same proportion as on-chip global wires. Conventionally, C4 bumps coupled with in-package transmission lines are used to interconnect chips within a multichip system [7] . However, signal quality deteriorations due to microwave effects, crosstalk coupling effects, signal reflections, and frequency-dependent lines losses in the transmission line limits the number of concurrent, high density inter-chip I/O [8] . This in turn restricts the possible off-chip bandwidth.
Different interconnect technologies such as vertically integrated 3D integration [3] , photonic interconnects [2] , inductive or capacitive coupling based interconnects [9] and wireless interconnects [10] are being explored to mitigate the performance issues of conventional I/O based multichip systems. In [11] wirelessly connected multichip modules are proposed for a High Performance Computing environment. In [12] transceivers for 60 GHz inter and intra-chip communications are designed. However, system-level performance gains are not evaluated in this work. In [13] on-chip wireless transceivers are used to facilitate fast pre-bonding wafer testing enabled by direct accesses to components under test within the ICs.
Novel interconnect paradigms such as wireless links are envisioned for on-chip data communications as well. Comprehensive surveys regarding various wireless NoC (WiNoC) architectures and their design principles is presented in [6] , [14] . Transmission line based RF Interconnects, surface wave based communication channels are proposed in [15] , [16] . On-chip antennas from graphene or Carbon Nanotube (CNT) based structures are predicted to provide high bandwidth wireless communication channels [17] [18] . However, integration of these antennas with standard CMOS processes needs to overcome significant challenges. Whereas mm-wave CMOS transceivers operating in the sub-THz frequency ranges is a more near-term solution. In [4] mm-wave wireless on-chip embedded antennas for intra-chip and inter-chip communication are designed and evaluated. Medium access mechanisms in Wireless NoCs using mmwave transceivers range from simple token passing based protocol to more sophisticated CDMA based mechanisms [6] , [19] , [20] , [21] . In this work, we propose a hybrid inter and intra-chip communication using both on-chip wired links and token-based mm-wave wireless interconnects. 
WIRELESS INTERCONNECTION FRAMEWORK FOR MULTICHIP SYSTEMS
The interconnection fabric of the proposed multichip system with wireless interconnects is a hybrid network with both wired and wireless links. Here we describe the topology, physical layer and communication protocols of this multichip interconnection framework.
Topology
In the proposed wireless interconnection framework cores within each individual chip is interconnected using an intra-chip NoC. We discuss the interconnection architectures for the multichip systems with two different intrachip NoC topologies as case studies to exhibit their role in the overall system. The topology of the chosen two intrachip NoCs are Mesh and a Small-World topology. The Mesh is chosen, as it is a conventional NoC topology used in several multicore based products [22] and is relatively easy to design, verify and manufacture. The Small-World topology is chosen, as it is suitable to design wireless NoCs as noted in [23] and is demonstrated to outperform the Mesh based NoC in [24] . The multichip systems with the two chosen intra-chip NoCs are described below.
Multichip System with Intra-Chip Mesh
In the first multichip interconnection framework the intrachip interconnection topology is a traditional Mesh based NoC. To alleviate the limitations of traditional inter-chip interconnects we equip NoC switches associated with cores embedded within the chip with WIs. In order to deploy the WIs each intra-chip Mesh NoC in each chip is further subdivided into a certain number of logical subnets. The WIs are deployed in a switch at the center of the subnets to avoid long multihop paths from all cores in its subnet. Fig. 2 shows the wireless multichip system with 4 chips. This WI deployment strategy corresponds to the approach that achieves minimum average distance (MAD) between all switches in an intra-chip NoC in [25] . This improves the connectivity of the entire multichip system by establishing direct wireless links between internal switches eliminating the need to travel to and from the periphery of the source and destination chips respectively to access the traditional I/O modules.
Multichip System with Intra-Chip Small-World
Insertion of bypass paths or long-range shortcuts realized with metal interconnects is shown to improve the performance in a conventional Mesh-based NoC [24] . SmallWorld networks are a type of complex networks often found in nature that are characterized by both shortdistance and long-range links. This improves the efficiency of the network as they have very low average number of hops between nodes even for very large network sizes. Hence, such network topologies are suitable for designing scalable, hybrid intra and inter-chip interconnection networks using wireless links in [23] .
To establish the wireline links within each intra-chip NoC while satisfying the properties of Small-World graphs, we generate the wireline topology according to the inverse power law to minimize wiring costs [26] .
where, P ði; jÞ is the probability of establishing a link, between two switches i and j, l ij is the manhattan distance, f ij is the frequency of communication between switch i and j and n is the total number of switches. As can be seen from (1), the probability of a link insertion between two switches i and j separated by l ij is proportional to the distance raised to a finite negative power. The value of a is chosen such that optimal wiring costs [26] are obtained. The distance is obtained by considering a tile-based floorplan of the cores on the die. The frequency of traffic interaction between the cores, f ij , is also factored into (1) so that more frequently communicating cores have a higher probability of having a direct link optimizing the topology for application-specific traffic. This power-law based link distribution results in both short distance connections and long-range links due to the non-zero probability of links between far-away nodes. The total number of these wireline links is considered the same as that in a Mesh of the corresponding size to ensure no undue advantage is granted to the Small-World architecture due to additional links. Also, an upper bound of 7 is imposed on the number of links attached to a particular switch so that no particular switch becomes unrealistically large [23] . The link setup method is repeated until no core or groups of cores are left unconnected. In this way, the intra-chip wireline Small-World NoC topology is created. In addition to these wireline links, the wireless transceivers are deployed to form the WIs at the same switches as in the Mesh based intra-chip NoC. This is done to form the same overlaid inter-chip wireless interconnect topology between the Mesh and Small-World based multichip systems.
Physical Layer
We envision the multichip system where wireless interconnects will enable seamless intra and inter-chip communications. Intra-chip communication will happen over the hybrid wireline and wireless NoC. Wireline links are realized with traditional global-wire based interconnects depending on the specific topology adopted. Several alternative technologies exist for realizing onchip and off-chip wireless interconnections [6] , [11] , [17] , [18] , [19] . We envision the use on-chip embedded miniature antennas operating in the 60 GHz mm-wave band that can be fabricated within the chip to establish direct communication channels between internal switches of the chips. The chosen on-chip antenna has to provide the best power gain for the smallest area overhead. Several on-chip antenna designs in the mm-wave bands have been investigated [4] , [11] , [27] , [28] . A linear dipole occupies a large area proportional to the wavelength of the carrier frequency. A patch antenna is directional mostly radiating perpendicular to its plane and not laterally towards other antennas. A logperiodic antenna can have higher power gains but is highly directional. We intend the chosen antenna to be compact as well as not directional. This is because we want to communicate between antennas, which are located in different chips and potentially at different angles with respect to each other's axes. A metal mm-wave zigzag antenna has been demonstrated to possess these characteristics as they are more compact compared to a linear dipole due to zigzag folding of the arms. In addition, such mm-wave antennas fabricated using top layer metals are CMOS process compatible making them suitable for near-term solutions to the wired interconnect problem [4] . Such mm-wave 60 GHz antennas are shown to have a bandwidth of 16 GHz for both intra-chip [5] and inter-chip [11] communications links. We have designed mm-wave zig-zag on-chip antennas and their co-planar feed structure to resonate in the 60 GHz frequency and provide a high bandwidth of 16 GHz and studied its characteristics in terms of return loss and path loss in a multichip system. A co-planar feed structure is chosen for the antenna as it has low losses compared to other feed structures such as microstrips. These antennas are also shown to not be directional [4] . This enables the WIs to communicate with any other WI in the system making the wireless medium a shared channel. Fig. 3 shows the specific dimensions of the antenna and its coplanar feed structure. A trace width of 5 um is used for all arms.
To ensure high throughput and energy efficiency, the WI transceiver circuitry has to provide a very wide bandwidth as well as low power consumption. Hence, we adopt the transceiver design from [5] where low power design considerations are taken into account at the architecture level. Non-coherent on-off keying (OOK) modulation is chosen, as it allows relatively simple and low-power circuit implementation.
Flow Control and Routing
The routing protocol for the proposed multichip system is a seamless intra and inter-chip data communication mechanism. We adopt wormhole switching for wireline links in the multichip system where data packets are broken down into flow control units or flits [29] . Wormhole switching is known to reduce the buffering requirements at the switches as unlike packet switching, whole packets are not stored and forwarded. This makes the on-chip NoC switches consume low power and occupy lower area. All switches have bidirectional ports for all links attached to it. All cores in the system have unique addresses. For the wireless links as well, we adopt wormhole switching which enables the energy-efficient token-based sleep/awake transceiver modes of operation as discussed in the next section.
As the overall system is not a regular network, we adopt a shortest path routing to optimize network performance. We use a forwarding-table based routing over precomputed shortest paths determined by Dijkstra's algorithm. Dijkstra's algorithm extracts a minimum spanning tree, which provides the shortest path between any pair of nodes in a graph. The exact minimum spanning tree depends on the chosen start node for the algorithm but the length of paths between any particular pair, along the tree does not depend on the start node. Hence, it is chosen randomly from among all the switches in the system. However, for a specific start node the shortest path along the extracted tree is always unique as the minimum spanning tree inherently eliminates loops. Consequently, deadlock is avoided by transferring flits along the shortest path routing tree extracted by Dijkstra's algorithm, as it is inherently free of cyclic dependencies.
As a result of using shortest path routing, the wireless links can also be used for intra-chip communication if they reduce the path lengths compared to a completely wireline path. Each switch only forwards the header flits to the next switch in the path to the final destination. The body flits simply follow the path laid out by the header according to the adopted wormhole switching protocol. Hence, each switch only has local forwarding information eliminating the need for maintaining non-scalable global routing information.
Wireless Communication Protocol
In mm-wave interconnects wireless bandwidth is limited by the state-of-the-art transceiver design and on-chip antenna technology. In order to improve performance, multiple wireless transceivers need to access the wireless medium to communicate via the energy-efficient wireless interconnects. Consequently, multiple transceivers share a single wireless frequency channel. Therefore, an efficient and collision-free medium access control (MAC) mechanism is needed.
Wireless Medium Access Control Scheme
Several MAC protocols have been investigated in the context of wireless NoCs. To enable Frequency Division Multiple Access (FDMA) using mm-wave bands transceivers tuned to multiple carrier frequencies need to be designed. Power efficient design of such transceivers is a non-trivial challenge. The system-level performance of CDMA based on-chip and off-chip wireless interconnection architectures have been evaluated in [10] , [21] . However, such CDMA schemes require precise synchronization between the transceivers to avoid inter-channel interference by preserving the orthogonality of the code-channels. Such synchronization is difficult to achieve in transceivers distributed across multiple chips. Similarly, synchronized classical Time Division Multiple Access (TDMA) is difficult to adopt in a multichip system for the same reason. Therefore, Asynchronous TDMA (A-TDMA) based on token passing [6] or Carrier Sense Multiple Access/Collision Detection (CSMA/CD) [30] are proposed. However, CSMA based A-TDMA does not perform well in the presence of the high traffic density due to exponential back-off [31] . Therefore, to access the wireless channel in a distributed fashion, without the need for precise synchronization or centralized arbitration while avoiding collision a token based medium access mechanism is proposed in [6] for WiNoCs. Hence, in this work we adopt a similar token-based A-TDMA medium access mechanism for the multichip systems using wireless interconnection. In a token-based medium access mechanism, the access to wireless medium is granted by the possession of a token. Only the WI possessing the token can transmit via the wireless medium. No separate request mechanism or priority is considered as a part of the token passing scheme to avoid the need for a central grant or arbitration unit enabling a distributed access mechanism.
To enable autonomous token passing among the WIs with fairness in accessing the wireless medium, the WIs are numbered sequentially in a virtual token ring. The token circulates autonomously between the WIs as a wireless flit in a round robin fashion. When a WI receives a token, it determines which VCs in its wireless port have complete packets. The WI holds the token for the duration (time slots) required to transmit all the complete packets in those VCs. This ensures the body flits can follow their header to the destination WI and preserves the integrity of the wormhole switching. Therefore, each WI can hold the token for a maximum duration of
where, n is the number of VCs in the wireless port of the WI, ' max is the maximum packet size in flits and t flit is the time (number of cycles) required to transmit a single flit over the wireless medium. The upper limit on the token holding duration prevents bandwidth starvation of WIs not currently possessing the token. This token based MAC prevents unutilized time slots and therefore maximizes utilization of the wireless bandwidth. It also helps in further increasing the energy-efficiency of the wireless transceivers as discussed below.
Improving Energy-Efficiency of the Wireless Interconnect Using the MAC
The energy efficiency of the wireless interconnects can be further increased by using sleep transistor based powergated transceivers instead of keeping them always awake. In [32] , [33] such a sleepy transceiver is implemented where control signals to turn them on/off are sent over specialized low latency wired Global-Line wires. However, in a multichip environment where WIs are distributed across different chips communicating the sleep/awake control signals using wired lines is challenging as it will require additional pin and I/O overhead. Therefore, to enable a power efficient sleep mode in the transceivers when they are not used we utilize the wireless communication protocol. Each header flit to be sent across the wireless medium from a transmitting WI contains the number of flits in the packet and the address of the destination WI. All other WIs, which are not the destination of that particular header will receive the header due to the broadcast nature of the non-directional antennas. On decoding the number of flits contained in that packet all WIs except the source and destination WIs, will go to sleep for the duration of the packet transmission. They will wake up after this duration to receive the next header (if there are other packets to send from other VCs) or token (if the token is being passed) and react accordingly. The flit-type field in the header, body or token flits will enable this feature. As the VCs contain entire packets as noted in the previous section, only full packets will be transmitted together over the wireless medium. Therefore, a new header or a token flit is transmitted and received by all WIs when they wake up. The architecture of the WI to enable the token based medium access is shown in Fig. 4 . The Token Unit is the main logical unit responsible for managing the token passing mechanism.
EXPERIMENTAL RESULTS
In this section, we evaluate the performance and energy efficiency of the wireless multichip systems. We compare the wireless interconnect based multichip system with both conventional I/O and alternative emerging interconnection based multichip system. The chip-to-chip I/O is adopted from [34] and is shown to have a bandwidth of 15 Gbps and an energy consumption of 5 pJ/bit. On the other hand, the delay and energy dissipation on the intra-chip wireline links is obtained through Cadence simulations taking into account the specific lengths of each link based on the established topology in the 20 Â 20 mm dies. The wireless transceiver adopted from [5] , [32] is designed and simulated using the TSMC 65 nm CMOS process and is shown to dissipate 2.31 pJ/bit sustaining a data rate of 16 Gbps with a biterror rate (BER) of less than 10 -15 while occupying an area of 0.3 mm 2 . The network switches and the Token Unit are synthesized from a RTL level design using 65 nm standard cell libraries from Chip MultiProjects (http://cmp.imag.fr), using Synopsys. The delay and power dissipation including both dynamic and static power consumption of these digital components are then incorporated in a cycle accurate simulator to evaluate the performance and energy efficiency of different multichip systems. The simulator characterizes the multichip architecture and models the progress of the flits over the switches and links per cycle accounting for those flits that reach the destination as well as those that are stalled. Ten thousand iterations were performed eliminating transients in the first thousand iterations. In our experiments, we consider each core to be connected to a three stage pipeline network switch [35] . The switches are connected with other switches according to the topology. The WI is modelled as a port connected to the network switches where they are deployed. We consider each input and output port of a switch to have 4 VCs with a buffer depth of 2 flits [35] for all the architectures considered in this paper. The ports associated to the WIs have an increased buffer depth of 64 flits to accommodate whole packets of maximum size. We consider a representative maximum packet size of 64 flits with a flit size of 32 bits in our experiments unless mentioned. All the digital components are driven by a 2.5 GHz clock and 1V power supply, which are the nominal frequency and voltage in the 65 nm technology node. In the Mesh based NoCs all wired links a considered to be singlecycle links whereas the long intra-chip wireline links in the Small-World architectures are pipelined by insertion of FIFO buffers.
Wireless Channel Characteristics and Wireless Link Budget Analysis
In this section we present a link budget analysis to determine the transmitted power that is required to achieve an acceptable BER on the intra and inter-chip wireless links. Fig. 5 shows the 4-chip system that we have used to design the on-chip antennas required for inter and intra-chip communication. As seen in Fig. 5a the individual chips are 20 Â 20 mm and are separated from each other by 10 mm. Fig. 5b shows the side-view of the multichip system with various layers and materials considered in our evaluation model. We have considered the chips to be housed on a substrate of FR4 Epoxy material, which typically is the material for Printed Circuit Boards (PCBs) [36] . The individual chips are considered to be packaged in a dielectric material called RXP4 [37] , which allows electromagnetic wave propagation enabling the interchip wireless communication. The antennas are considered to be embedded in a 2 um layer of silicon dioxide (silica) over a 633 um thick substrate of silicon of the chips. The transmitted power, P t in dBm on the wireless channels is given by the following equation
where, SNR is the signal to noise ratio at the receiver in dB, PL is the path loss in dB and N f is the receiver noise floor in dBm. An SNR of 15 dB results in a BER of less than 10 -15 for the OOK modulation scheme adopted here. A BER of 10 -15 is comparable to wireline data transfer in current technologies. Hence, we consider a required SNR of 15 dB in our link-budget analysis. Fig. 6a, 6b , and 6c show the radiation pattern, return loss, and worst-case path loss respectively of the designed mm-wave antennas in a 4-chip system, which we use for system level analysis in this work. The characteristics of the antennas are simulated using HFSS [38] . The radiation pattern shows that antenna radiates substantially in all directions as required for the proposed communication system. The return loss shows that the antennas are tuned to resonate at 60 GHz designed with the dimensions mentioned in section 3.2. The worst case path loss, PL between two antennas which are farthest apart in the 4-chip system as shown in Fig. 5a is 34.9 dB. The noise floor of the receiver is -59 dBm [39] . Consequently, the output power of the transmitter is -19.43 dBm in the worst case. The power consumption of the transceivers, which is capable of generating this transmitted power as shown in [5] , is considered in the following sections for system-level performance evaluation. We have observed that the return loss of the antennas are 0 dB at low frequencies between 0 to 10 GHz. This eliminates the possibility of interference with digital signals in the ICs due to their non-overlapping operational bands. 
Comparative Performance Evaluation
We evaluate the wireless multichip systems in terms of peak achievable bandwidth per core, average packet latency and average packet energy consumption and compare it with several wireline I/O based multichip systems.
Architectures for Comparison
We consider six multichip systems with different interchip connection configurations for a comparative performance evaluation. These configurations are shown in Table 1 . In these configurations, each multicore chip is considered to have 64 cores where each core is connected to a NoC switch. We consider two different intra-chip topologies, Mesh and Small-World to evaluate the effect of intra-chip network on the performance of multichip system.
Among the six configurations, four configurations use I/O based wireline chip-to-chip interconnection and two configurations use wireless interconnection for chip-to-chip communication. Due to large pitch of substrate-to-board pins [7] the number of pins dedicated for I/O operations is limited. More importantly, crosstalk between parallel interchip interconnects which can be several tens of millimeters long severely limits signal integrity. As shown in [8] , signal integrity can be maintained in high-speed I/O based interchip communication only in the total absence of crosstalk. Therefore, to completely eliminate crosstalk only a single inter-chip interconnect line is considered to exist between a pair of chips. To achieve this in the Bus I/O based wireline configurations, only one switch along one edge of each chip (except the corner) is connected to the I/O module. One of the middle switches is chosen as it is connected to three neighbors in the Mesh based intra-chip NoC as shown in Fig. 7a . For the Small-World based configurations the same switches are chosen for the I/O modules to implement the same interchip architecture.
In order to investigate the effect of increased bandwidth of the inter-chip wired links we investigate the Network I/O configuration where we equip multiple switches in each chip with the I/O modules. However, between a particular pair of chips there is only a single interchip link thus eliminating signal crosstalk. The chips are in turn connected in a mesh configuration among themselves via switches along the edges, using the I/O based inter-chip interconnects as shown in Fig. 7b .
In Bus I/O based wireline configuration, interchip communication happens through a shared bus. For bus access we have adopted an independent guaranteed bandwidth arbitration appropriate for high-speed I/O busses, which combines a distributed TDMA approach with round robin access [40] . Simple slotted TDMA scheme is not realistic in a multichip system due to the fact that it is impossible to achieve precise synchronization between multiple chips in current and future technologies. Therefore, an asynchronous and distributed access mechanism is necessary. However, a traditional request/grant based asynchronous centralized arbitration common in on-chip SoC busses is impractical as it needs additional control lines to the arbiter in addition to the data lines. In high-speed I/O as discussed before, implementing additional control (request/grant) lines would need additional I/O ports and pins and exacerbate the crosstalk noise causing severe signal integrity issues. Therefore, we enabled the distributed TDMA with a control flit broadcast to all the chips on the bus that passes the access to the next chip at the end of the transmission from the current chip. In order to avoid bandwidth starvation each chip can access the bus for a maximum duration as given by (2) similar to the WIs. Also, the VC configurations of the switches attached to the bus are the same as in the WIs.
Unlike the Bus I/O based configuration, a wormhole based switching is adopted for the interchip communication in the Network I/O configuration. On the other hand, for both the wireless configurations, we have considered 4 WIs per chip located at the center of the subnets within the chips as discussed in the design methodology earlier.
Achievable Performance
In this section we evaluate the performance of the multichip systems with wireless interconnections. First, we evaluate the peak achievable bandwidth per core at network saturation using uniform random traffic as shown in Fig. 8 . The peak achievable bandwidth per core is measured as the maximum sustainable data rate in number of bits successfully routed per core per second at network saturation. It can be observed that the systems with wireless interconnections have higher bandwidth compared to all the wireline I/O interconnection for all system sizes. This is because the wireless nodes connect switches inside the chips directly over single-hop links for both intra as well as inter-chip data transfer. Therefore, even for a single-chip case, even when there is no inter-chip traffic the configurations with wireless interconnects have higher bandwidth compared to the solely wireline intra-chip NoCs. This is in agreement with several mm-wave wireless intra-chip NoC papers [6] , [20] . On the other hand, for all the wireline I/O based multichip systems the data packets need to travel from internal cores to the peripheral I/O module and then get routed over the interchip link and again travel to internal nodes at the destination chip. Among the wireline configurations, the Bus based multichip systems has the lowest performance and are not scalable due to the non-scalable busbased interconnection. The Network I/O based wireline configuration has higher performance than the Bus. This is because the Network configuration allows concurrent communication between the neighboring chips and has multiple switches equipped with the I/O modules. It can be observed that the wireless multichip system is able to sustain a bandwidth per core higher than 10 Gbps even for a 4-chip system. The degradation also seems to be asymptotic at 10 Gbps. However, with the conventional I/O the bandwidth is more than 10X lower. This demonstrates the significantly higher bandwidth provided by the direct chipto-chip wireless links. Fig. 9 shows the average packet latency for the various multichip systems with 4 chips with uniform random traffic. Due to different average distances between cores in the different multichip interconnection architectures the latency characteristics are different. This is demonstrated by the average latencies at low injections loads. It can be observed that the wireless multichip has the lowest latency compared to the systems with interchip wireline interconnections. This is because of the shorter average path lengths due to WIs located inside the chips providing single hop links between cores located inside distant chips.
The difference in bandwidth between the multichip systems with Mesh based and Small-World based intra-chip NoC architectures diminishes with increase in the number of chips as can be seen from Fig. 8 . This is also further verified from Fig. 9 where the variation in average latency is negligible between the 4-chips systems with either Mesh or Small-World based intra-chip NoCs. This is because with increase in the number of chips, the impact of the local NoC in each individual chip decreases on the overall system performance. We believe this trend to continue and hence, the importance of intra-chip NoC architecture to diminish compared to the interchip interconnection as the system size scales up.
Average Packet Energy
In this section, we compare the overall average packet energy dissipation of different multichip systems interconnected with I/O based wireline interconnects and wireless interconnects. Average packet energy is the energy consumed to transfer an entire packet from source to destination in the multichip system on average. Fig. 10 shows the overall average packet energy dissipation of different multichip systems investigated in this paper. The average packet energy dissipation for all system sizes including the singlechip case is lower for the wireless multichip systems compared to all the I/O based multichip systems. In the singlechip case the use of wireless links for intra-chip data routing results in lowering of packet energy consumption due to the single hop energy-efficient wireless interconnects. This phenomenon is observed in various works on wireless intrachip NoCs [6] , [20] . The difference in average packet energy between these wireline and wireless multichip systems becomes more evident with an increase in system size as the packet energy dissipation for the I/O based multichip systems increase significantly with an increase in the number of chips. Alternatively, the average packet energy in the wirelessly connected system does not increase as drastically. This is due to the direct energy-efficient wireless links between cores embedded in the multicore chips. As the number of chips increase, the interchip traffic also increases in proportion from zero percentage in the single-chip scenario to 75 percent in the 4-chip case due to spatially uniform traffic. Because of this, a large proportion of traffic travels to and from the I/O modules using multihop wired paths over the intra-chip NoCs. This multihop path is reduced by use of the WIs deployed inside the chips. This is the main architectural factor behind the gains in energy savings for the wireless multichip systems.
Among all the I/O based configurations, the Network I/O has lower average packet energy. This is because, in the networked interconnection based multichip systems, the I/O buffers are less congested compared to Bus configurations resulting in faster movement of data packets occupying buffers and interconnection resources for shorter durations. As both the performance and the energy efficiency of the Network I/O based configuration is better than the Bus I/O based configurations discussed in this paper, we consider this configuration as a baseline I/O based configuration for comparison in the following sections.
On the other hand, the Small-World NoC based wireless interconnection has lower packet energy compared to the Mesh NoC based wireless interconnection. This is because the Small-World nature of the topology reduces the average hop-count of the network by establishing long-range singlehop direct links. This effect is also demonstrated in recent literatures [23] , [41] . Hence, in next sections, we consider the Small-WorldþToken architecture to evaluate the performance of wireless multichip system.
Effect of Increase in Flit Width on Overall System Performance
In this section we analyze the effect of increasing flit width for Small-WorldþToken based wireless multichip system with uniform random traffic patterns and compare it with the Small-WorldþI/O (Network) architecture. A 2-chip system is considered in this section. For this experiment, we used four different flit sizes of 32, 64, 128, and 256 bits. This is because as noted in [42] , higher flit widths beyond 128 are shown to provide marginal gains in performance of a NoC based system. In case of wireline intra-chip interconnections, widening physical channel width to accommodate larger flit width will increase the data rate on the wireline links. In case of the conventional I/O based interchip interconnect the increase in flit-width translates into increasing the bandwidth of the interconnection by using multiple channels per link. However, signal deterioration due to crosstalk coupling effects, microwave effects, and frequency-dependent losses in the transmission lines, limits the number of parallel lines in the I/O modules. Here, we only analyze the system-level performance metrics such as bandwidth per core and average packet energy in Fig. 11 .
On the other hand, the data rate of the wireless links is governed by the speed of the transceiver and bandwidth of the antennas, which does not change with flit size. Hence, while the wireline communication becomes faster with increase in flit size, the wireless communication speed remains constant. This results in a reduction in relative gains for the wireless multichip communication architecture with respect to the conventional I/O based system as shown in Fig. 11 . However, even with a flit width of 256 bits (8 parallel I/O channels per link) we see a relative improvement of 4.6 in data bandwidth and 3.1 in average packet energy for a 2-chip system. In addition, we note that the reduction in relative gains for both bandwidth and average packet energy display an asymptotic behavior. This means that although the gains of using wireless interconnections decrease with increase in flit size the gain will stabilize beyond a point as the performance of the wireline interconnection does not continue to improve with flit size beyond 128 or 256 bits.
Deployment of the Wireless Interconnection with Scaling of System Size
In this section, we discuss the deployment methodology of the wireless interconnection for multichip systems when the system scales up. In our earlier experiments, we keep the number of WIs per chip constant and scale up the system i.e., increase the number of chips per system. However, in this approach, total number of WIs keep increasing which will negatively affect the performance beyond a point as it will take increasingly longer time for each individual WI to possess the token and gain access to the medium. Hence, we have considered another alternative approach to deploy the WIs when system scales up. In this second approach, we keep the total number of WIs per system constant and distribute the WIs among the chips. For the first approach, we consider 4 WIs per chip and increase the system size. However, in the second approach, we are distributing 4WIs among the chips that results in 1 WI per chip for a 4-chip system significantly degrading the performance. Hence, to study the deployment methodology more comprehensively, we consider another configuration with 16 WIs for the whole system, and evaluate its performance. The bandwidth per core and average packet energy for 1, 2, and 4 chip systems for two different wireless interconnection deployment methodologies are shown in Fig. 12 . For this study, the Small-WorldþToken architecture is considered in all the cases. The peak bandwidth per core is higher for the system with constant number of WIs per chip than that of the alternative approach. This is because in the first approach with increasing system size number of WIs also increases. It is true that increasing the number of WIs increase the token return period, it also helps to distribute the inter-chip traffic among the WIs. On the other hand, for the second approach with 4 WIs for the whole system, with increasing system size the volume of inter-chip communication increases whereas the number of WIs per chip decreases. This increases congestion at the wireless interfaces and adversely affects the bandwidth. This also causes a relative increase in the packet energy. To study the impact of second methodology with a higher number of WIs, we deploy 16 WIs in the whole system. However, even in this case, the peak bandwidth per core is lower than that of the system with constant number of WIs per chip. In the singlechip case, the performance with 16 WIs is lower than that with 4 WIs as each WI has to wait much longer for accessing the wireless channel. These two approaches are equal in peak bandwidth per core and average packet energy in the 4-chip case because the two systems are identical. Hence, for the system sizes considered in this experiment, having a constant number of WIs per chip is a better deployment approach for the wireless multichip system.
To investigate the effect of this deployment policy on the scaling of system size and dimensions further, we evaluate a multichip system with 9 chips. Each chip is considered to be 20 Â 20 mm and a space of 10 mm is assumed between the edges of the chips as well as the edge of the substrate board. Thus the overall dimensions of the board is 10 Â 10 cm. Fig. 13 shows the relative gains in bandwidth and average packet energy of the Small-World based wireless multichip system with respect to small world based wireline (Network) multichip system for various system sizes. With increase in number of chips and consequent increase in number of WIs, the token based wireless interconnection suffer a degradation in performance. However, it can be seen that the relative gains do not decrease significantly with increase in size because the performance of the wireline multichip systems also decrease with increase in size.
Performance Evaluation with Non-Uniform Traffic
In this section we analyze the bandwidth and average packet energy in the Small-WorldþToken based multichip system with non-uniform traffic patterns and compare it with the Small-WorldþI/O (Network). First, we use hotspot and transpose synthetic traffic pattern to evaluate these multichip systems. In the hotspot, 5 percent of all traffic generated from all cores has the same destination, which is the hotspot core. A single core was chosen randomly from the system as the hotspot. All other packets are destined to other cores following a uniform random distribution. This type of traffic pattern is fairly common for directory-based cache-coherent shared memory multiprocessor system where communication among the on-chip core and memory subsystem is more frequent [43] . To generate the transpose traffic pattern, each core generates packet only destined to cores that is diametrically opposite to it in the whole system. For example, the ith core will only send data packets to the (N-iþ1)th core, where, N is the total number of cores in the entire multichip system. The bandwidth per core and average packet energy for the Small-World NoC based multichip system for the one and two-chip cases are shown in Fig. 14a and 14b at network saturation respectively. As can be seen from results the wireless Small-World system outperforms the I/O based multichip system for all the non-uniform traffic patterns. In the 2-chip system with both hotspot and transpose traffic patterns a significant portion of the traffic accesses the interchip communication medium. Hence, the distributed wireless interconnects improve the bandwidth and average packet energy in both cases compared to the wireline interchip communication. In transpose traffic pattern all data packets from all cores travel across the inter-chip communication medium. Hence, the relative gains of the wireless interchip interconnection is the most evident with this traffic pattern.
Our observations with the uniform and non-uniform traffic patterns indicate a strong correlation of the overall performance of the multichip system with that of the proportion of interchip traffic. However, it is hard to estimate or predict the proportion of interchip versus intra-chip traffic in the set of applications suitable for modern and future multichip systems. Hence, we study the change in performance by varying the degree of localization in the traffic as a direct parameter. We define the localization parameter as the percentage of data packets from each core that has a destination randomly chosen from among the cores within the same chip. Fig. 15 shows the bandwidth per core and average packet energy, as the localization parameter is varied from 25 percent to 100 percent for a 2-chip system with each chip having 64 cores interconnected with the SmallWorld architecture. This captures the possible spectrum of traffic patterns while demonstrating how the performance depends on it. The performance of both wires and wireless systems increase with increase in localization, due to reduced dependence on interchip interconnects. However, for low localization or increased interchip traffic, the role of the interchip interconnections become important as one would expect and the gains of the wireless chip-to-chip links increases compared to the wired I/O system.
Comparative Evaluation with Respect to Emerging Multichip Integration Technologies
In this section we compare the token based mm-wave wireless multichip interconnection system with various emerging alternatives. In the proposed token passing based wireless medium access mechanism, only a single transmitter can access the wireless channel at any given instant of time while multiple transceivers are deployed over the entire system. This limits the potential performance benefits of wireless architecture. Enabling concurrent communication channels without any interference can ensure better utilization of the available bandwidth. This can be achieved by either designing a MAC protocol like Direct Sequence Spread Spectrum (DSSS) based Code Division Multiple Access (CDMA) channel access mechanism [10] , [21] or Frequency Division Multiple Access using novel antenna technology like Carbon Nanotube (CNT) based nano-antennas operating in multiple THz frequency bands [18] . To study the potential performance improvement with these advanced techniques we evaluate the same interconnection framework with Small World (SW) intra-chip NoCs, just replacing the token- based wireless transceivers with CDMA based (SWþ CDMA) and CNT antenna based ones (SWþCNT). Off-chip photonic interconnects has emerged as another enabling technology for chip-to-chip communication [2] . Hence, we compare the proposed wireless interconnection architecture with a photonic multichip system as well. In the photonic multichip system, the interchip communication happens through high bandwidth photonic interfaces with intra-chip SW NoCs with each chip (SWþPhotonic). To connect these interface switches through a single waveguide, we consider these switches to be located at one edge of the chip. For our experiment, we consider four photonic interfaces per chip and one waveguide with 16-way Wavelength Division Multiplexing (WDM) channels.
The energy/bit for a single point-to-point link and possible aggregate physical bandwidth provided by each of these technologies are summarized in Table 2 . Fig. 16 shows the peak bandwidth per core and overall system average packet energy for 4-chip systems with these different interconnect technologies. It can be seen that, SWþtoken system has the lowest bandwidth per core and highest average packet energy among all the configurations considered here. This is because only a single transmitter can access the wireless channel at any given instant of time. Designing more complex MAC schemes like CDMA or using a novel antenna technology can improve this bandwidth to an extent due to concurrent communication among the WIs. Due to the more efficient physical layer the SWþCNT based system provides higher bandwidth and consumes lower average packet energy compared to the SWþCDMA system.
The SWþPhotonic outperforms both the token and CDMA based wireless multichip system due to presence of high bandwidth concurrent links. However, the performance of the photonic multichip system is lower than that of the CNT based wireless system. This is due to the fact that data packets can get routed from internal switches using the wireless links. Whereas in case of the photonic system, the data packets will have to reach the photonic interfaces of the chip in its periphery. Moreover, in CNT based wireless multichip system, intra-chip communication is also possible using the wireless channels without requiring any additional overheads. The aggregate physical bandwidth of both the CNT based wireless and the photonic interconnection framework can be increased by deploying more CNT based antennas and using denser WDM respectively resulting in further improvements in both systems.
While improvement in performance and energy efficiency is possible by implementing complex MAC or utilizing novel CNT or photonic technology, there have several other challenges. To maintain the orthogonality among different code-channels in CDMA based MAC, transmitters are required to be precisely synchronized to eliminate interchannel interference. It is difficult to achieve in a multichip environment as the WIs are distributed across different chips. Alternatively, integration of these CNT antennas or on-chip silicon photonic devices to have both intra and inter-chip photonic communication in standard CMOS fabrication processes needs to overcome significant challenges [18] . On the other hand, token-based wireless system utilizing metal-zigzag antennas are CMOS compatible and outperforms conventional wireline interchip communication systems. Hence, it is a nearer term alternative as the communication backbone for multichip systems providing significant gains in performance.
Area Overheads
In this section, we estimate the comparative area overheads of the various architectures studied in this paper. The number of wired intra-chip links in all configurations are same as that of a conventional Mesh NoCs i.e., number of intra-chip links in the Small-World architecture is constrained to be the same as that of the conventional Mesh. The only difference is the I/O modules, wireless transceivers and the area of ports associated with them. Fig. 17 shows the total area overhead of the various interconnection architectures for different multichip configuration considered in this paper for a 4-chip system. In case of token-based architectures, each transceiver occupies an area of 0.3 mm 2 [5] whereas in I/O based architectures, each transceiver has an area of 0.088 mm 2 [34] . For the wireless multichip systems of the largest configuration, the total area of the interconnection network is 1.92 percent of the entire system while the wireless overhead is only 0.46 percent assuming each chip is 20 Â 20 mm. The proportion of the various area overheads remain similar for other system sizes using the wireless interconnections, as the number of WIs per chip remains the same.
CONCLUSION AND FUTURE WORK
HPC environments and datacenters employ modules with multiple multicore chips in a package or on a board. The density and bandwidth of high speed I/O for interchip interconnections are becoming the power-performance bottleneck for such multichip systems. In this paper we explore the advantages possible if interchip communication in multichip modules can be realized with state-of-the-art mm-wave wireless links operating in the 60 GHz band. The wireless links are capable of establishing direct communication channels between cores in different chips via on-chip embedded antennas. Moreover, the wireless links can be used for a seamless data transfer between cores in the same chip as well to augment the traditional NoC backbone for intra-chip communications. These factors result in significant gains in performance and energy efficiency in both intra and inter-chip data communications. The energyefficiency of the wireless interconnects have been improved by careful wireless data transfer protocol design to put unused WIs to sleep using power-gated transceivers. It can be further enhanced by using variable levels of power amplifications [44] depending upon the length of the wireless interconnects and associated path losses in the future.
Md Shahriar Shamim received the BSc degree in electrical and electronic engineering from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2011. He is currently working toward the PhD degree in computing and information sciences at Rochester Institute of Technology, Rochester, NY. His research interests include designing scalable interconnection architectures with emerging technologies for multichip integration. He is a student member of the IEEE. Naseef Mansoor received the BSc degree in computer science and engineering from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2009. He is currently working toward the PhD degree in computing and information science at Rochester Institute of Technology, Rochester, NY. His research interests include wireless and photonic network-onchip architectures and robust and dynamic medium access mechanisms for on-chip wireless interconnect. He is a student member of the IEEE.
Rounak Singh Narde received the BTech degree in electronics and telecommunications from National Institute of Technology, Raipur, India, in 2013 and the MS degree in electrical engineering from Rochester Institute of Technology, Rochester, NY, in 2016, where he is currently working toward the PhD degree in engineering. His research focuses on embedded on-chip antennas and characterizing wireless channel for intra-chip and inter-chip communications. He is a student member of the IEEE.
Vignesh Kothandapani received the BE degree in electronics and communication engineering from Sri Venkateswara College of Engineering, Chennai, Tamil Nadu, India, in 2013 and the MS degree in computer engineering from Rochester Institute of Technology, Rochester, NY, in 2016. His research interest is focused on wireless network-on-chip architectures for inter and intra-chip communication. Jayanti Venkataraman is a professor in the Electrical and Microelectronic Engineering Department, Rochester Institute of Technology, Rochester, NY. She is the director of the Electromagnetics, Microwave and Antenna Laboratory, where she has developed the area of electromagnetics for the EE undergraduate and graduate programs. Her research interests include antennas and microwave circuits, composite right/left handed materials and bioelectromagnetics. She is a senior member of the IEEE.
Amlan Ganguly
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
