### TOWARDS A SCALABLE AND RELIABLE WIRELESS

### **NETWORK-ON-CHIP**

By

### AMLAN GANGULY

A dissertation submitted in partial fulfillment of the requirements for the degree of

### **DOCTOR OF PHILOSOPHY**

### WASHINGTON STATE UNIVERSITY

School of Electrical Engineering and Computer Science

December 2010

To the Faculty of Washington State University:

The members of the committee appointed to examine the dissertation of AMLAN GANGULY find it satisfactory and recommend that it be accepted.

Partha Pratim Pande, Ph.D., Chair

Benjamin Belzer, Ph.D.

Deuk Hyoun Heo, Ph.D.

#### ACKNOWLEDGEMENT

I would like to take this opportunity to express my gratefulness to my advisor Dr. Partha Pratim Pande for having guided me through the curriculum so well. His active involvement in my research and incessant inspiration has made this work possible. I also thank him for having allowed me freedom of thought and choice of research direction.

Special thanks go to Dr. Benjamin Belzer for having helped me with his expertise in communication and coding theory providing a strong buttress to my work. I thank Dr. Alireza Nojeh for his inputs regarding CNT antennas. Dr. Christof Teuscher deserves a special note of thanks for his help in the domain of complex networks. I would also like to thank Dr. Deuk Heo for helpful insights about analog and RF components essential to my work. My work was partially supported by the US National Science Foundation (NSF) CAREER grant CCF-0845504 and NSF grant CCF-0635390.

I would also like to thank my colleagues Mr. Kevin Chang and Mr. Sujay Deb for their frequent help and brainstorming which always helped me to strengthen the foundations of my conceptual understanding of the problems.

My parents, Mr. Ashutosh Ganguly and Mrs. Uma Ganguly have always been extremely inspiring. Through their experience and caring they have made it possible for me to pursue research at a school of higher learning. Without their support none of this work would have been possible.

Last but most importantly I thank my fiancée Miss Rini Mukhopadhyay for her patience and understanding in patiently awaiting attention from a graduate student. Her unflinching faith in me and curiosity about my work and publications made my research experience even more rewarding.

# TOWARDS A SCALABLE AND RELIABLE WIRELESS NETWORK-ON-CHIP

Abstract

by Amlan Ganguly, Ph.D. Washington State University December 2010

Chair: Partha Pratim Pande

Multi-core platforms are emerging trends in the design of Systems-on-Chip (SoCs). Interconnect fabrics for these multi-core SoCs play a crucial role in achieving the target performance. The Network-on-Chip (NoC) paradigm has been proposed as a promising solution for designing the interconnect fabric of multi-core SoCs. But the performance requirements of NoC infrastructures in future technology nodes cannot be met by relying only on material innovation with traditional scaling. The continuing demand for low power and high speed interconnects with technology scaling necessitates looking beyond the conventional planar metal/dielectric-based interconnect infrastructures. Among different possible alternatives, the on-chip wireless communication network is envisioned as a revolutionary methodology, capable of bringing significant performance gains for multi-core SoCs. Wireless NoCs (WiNoCs) can be designed by using miniaturized on-chip antennas as an enabling technology. In this work, design methodologies and technology requirements for scalable WiNoC architectures are presented and their performance is evaluated. It is demonstrated that WiNoCs outperform their wired counterparts in terms of network throughput and latency, and

that energy dissipation improves by orders of magnitude under various experimental and reallife scenarios. A major challenge that NoC design is expected to face is related to the intrinsic unreliability of the interconnect infrastructure under technology limitations. The devices and components of the WiNoCs are expected to suffer high failure rates. By incorporating error control coding (ECC) schemes along the interconnects, NoC architectures will be able to provide correct functionality even in the presence of different sources of transient noise and yet have low energy dissipation. In this work, designs of novel joint crosstalk avoidance and multiple error correction/detection codes as well as burst error correction codes are proposed and their performance is evaluated in different WiNoC fabrics. It is demonstrated that by using the proposed codes WiNoCs can achieve the same reliability as a wireline NoC with much less energy dissipation and higher performance.

## TABLE OF CONTENTS

| ACKNOWLEDGEMENT                                |    |
|------------------------------------------------|----|
| ABSTRACT                                       | IV |
| LIST OF TABLES                                 | IX |
| LIST OF FIGURES                                | X  |
| CHAPTER 1                                      | 1  |
| INTRODUCTION                                   | 1  |
| 1.1 System-on-Chip Design Issues               | 1  |
| 1.2 THE NETWORK-ON-CHIP PARADIGM               | 2  |
| 1.3 LIMITATIONS OF CONVENTIONAL NOCS           | 3  |
| 1.4 One Possible Solution                      | 3  |
| 1.5 SIGNAL INTEGRITY IN WINOCS                 | 5  |
| 1.6 Contributions                              | 6  |
| 1.7 Thesis Organization                        | 7  |
| 1.8 Reference                                  | 8  |
| CHAPTER 2                                      |    |
| RELATED WORK                                   |    |
| 2.1 BACKGROUND                                 |    |
| 2.2 Reference                                  |    |
| CHAPTER 3                                      | 16 |
| WIRELESS NOC ARCHITECTURE                      |    |
| 3.1 Design Methodologies                       |    |
| 3.1.1. Topology                                |    |
| 3.1.2 Wireless link insertion and optimization |    |
| 3.1.3 On-Chip Antennas                         |    |
| 3.1.4 Routing and Communication Protocols      |    |
| 3.2 Experimental Results                       |    |
| 3.2.1 Establishment of Wireless Links          |    |
| 3.2.2 Performance Metrics                      |    |
| 3.2.3 Performance Evaluation                   |    |
| 3.2.4 Comparison with wired NoCs               |    |

| 3.2.5 Comparative analysis with other emerging NoC paradigms                   | 47  |
|--------------------------------------------------------------------------------|-----|
| 3.2.6 Traffic dependent wireless link insertion                                |     |
| 3.2.7 Area Overheads                                                           |     |
| 3.3 CONCLUSIONS                                                                | 54  |
| 3.4 Reference                                                                  | 55  |
| CHAPTER 4                                                                      | 58  |
| SIGNAL INTEGRITY OF WINOC                                                      | 58  |
| 4.1 Error Control Coding for the Wireline Links                                |     |
| 4.1.1 Error Detection Scheme                                                   |     |
| 4.1.2 Duplicate Add Parity and Modified Dual Rail Code                         | 60  |
| 4.1.3 Boundary Shift Code                                                      | 61  |
| 4.1.4 Crosstalk Avoidance Double Error Correction Code                         | 63  |
| CADEC Encoder                                                                  | 63  |
| CADEC Decoder                                                                  | 64  |
| 4.1.5 Joint Crosstalk Avoidance and Triple Error Correction Code               | 66  |
| JTEC Encoder                                                                   |     |
| JTEC Decoder                                                                   | 67  |
| Optimization of the Code                                                       | 69  |
| 4.1.6 Joint Triple Error Correction and Simultaneous Quadruple Error Detection |     |
| JTEC-SQED Encoder                                                              |     |
| JTEC-SQED Decoder                                                              |     |
| 4.1.7 Performance evaluation of the ECC schemes in a wireline NoC              |     |
| Voltage Swing Reduction Due to Increased Reliability                           | 74  |
| Residual Probability of Word Error                                             |     |
| Energy Dissipation of ECC schemes in a wireline NoC                            |     |
| 4.2 Error Control Coding for the Wireless Links                                |     |
| 4.2.1 Wireless Channel Model                                                   |     |
| 4.2.2 Proposed Product Code for the Wireless Links                             |     |
| 4.2.3 Residual BER of the wireless channel with H-PC                           |     |
| 4.3 EXPERIMENTAL RESULTS                                                       | 96  |
| 4.4 CONCLUSIONS                                                                |     |
| 4.5 REFERENCE                                                                  | 101 |
| CHAPTER 5                                                                      |     |
| CONCLUSIONS AND FUTURE WORK                                                    | 104 |

| 5.1 Conclusions                                        | 104 |
|--------------------------------------------------------|-----|
| 5.2 FUTURE DIRECTIONS                                  | 105 |
| 5.2.1 Wireless NoCs with millimeter-wave Interconnects | 105 |
| 5.2.2 Extension of the ECC schemes                     | 106 |
| 5.2.3 Complex Network based WiNoC architectures        | 106 |
| 5.3 SUMMARY                                            | 109 |
| 5.4 Reference                                          | 109 |
| APPENDIX A                                             | 110 |
| PUBLICATIONS                                           | 110 |
| Book Chapters:                                         | 110 |
| Journals:                                              | 110 |
| Conferences:                                           | 111 |

## List of Tables

| Table 3.1 | Average distance for optimized WiNoCs.                                                          | 31 |
|-----------|-------------------------------------------------------------------------------------------------|----|
| Table 3.2 | Delays on wired links in the WiNoCs.                                                            | 36 |
| Table 3.3 | Packet energy dissipation for flat wired mesh, WiNoC and hierarchical G-line NoC architectures. | 41 |
| Table 3.4 | Percentage of packet energy dissipation on long-range links.                                    | 47 |
| Table 3.5 | Total area overhead of wireless ports.                                                          | 53 |
| Table 4.1 | Coded flit structure for different coding schemes.                                              | 62 |
| Table 4.2 | Delay for Each Coding Scheme.                                                                   | 99 |
| Table 4.3 | Area Overhead of the Codec for Each Coding Scheme.                                              | 99 |

# List of Figures

| Figure 1.1  | A regular tile based Mesh NoC.                                                                                                                                                    | 02 |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Figure 3.1  | (a) Mesh topology of subnet with a hub connected to all switches in the subnet. (b) Network topology of hubs connected by a small-world graph with both wired and wireless links. | 18 |
| Figure 3.2  | Flow diagram for the simulated annealing based optimization of WiNoC architectures.                                                                                               | 22 |
| Figure 3.3  | Adopted communication protocol for the wireless channel.                                                                                                                          | 28 |
| Figure 3.4  | The optimal wireless link arrangement for (a) 1, (b) 6, and (c) 24 wireless links among 16 hubs. Note that symmetrical solutions with the same performance are possible.          | 30 |
| Figure 3.5  | (a) Number of iterations required to reach optimal solution by the SA and exhaustive search methods (b) Convergence with different temperatures                                   | 31 |
| Figure 3.6  | The components of a WB with multiple wired and wireless ports at input and output.                                                                                                | 34 |
| Figure 3.7  | (a) Throughput and (b) latency of 256 core WiNoCs with different numbers of wireless links.                                                                                       | 35 |
| Figure 3.8  | Throughput of 256 core WiNoC for various hierarchical configurations.                                                                                                             | 37 |
| Figure 3.9  | Saturation throughput with varying (a) number of subnets and (b) size of each subnet.                                                                                             | 38 |
| Figure 3.10 | Packet energy dissipation with varying (a) number of subnets and (b) size of each subnet.                                                                                         | 41 |
| Figure 3.11 | Components of packet energy dissipation for a (a) flat mesh and (b) WiNoC. Values of energy dissipation are labeled in nJ units.                                                  | 43 |
| Figure 3.12 | Energy dissipation per bit on G-line and wireless links with varying link lengths.                                                                                                | 46 |
| Figure 3.13 | Packet energy dissipation per bandwidth and achievable NoC bandwidth for various types of emerging NoC paradigms for system size of 128 cores.                                    | 48 |

| Figure 3.14 | Throughput of 128 core WiNoC with various traffic patterns.                        | 51 |
|-------------|------------------------------------------------------------------------------------|----|
| Figure 3.15 | Total silicon area in WiNoC hubs and switches.                                     | 53 |
| Figure 3.16 | Wiring requirement of WiNoC.                                                       | 54 |
| Figure 4.1  | (a) Duplicate Add Parity (DAP) encoder (b) decoder.                                | 61 |
| Figure 4.2  | (a) BSC encoder, (b) decoder.                                                      | 63 |
| Figure 4.3  | (a) CADEC Encoder. (b) CADEC Decoder.                                              | 64 |
| Figure 4.4  | CADEC decoding algorithm.                                                          | 65 |
| Figure 4.5  | JTEC encoder schematic.                                                            | 67 |
| Figure 4.6  | Flowcharts for the decoding schemes for (a) JTEC and (b) Optimized JTEC            | 68 |
| Figure 4.7  | JTEC encoder schematic.                                                            | 70 |
| Figure 4.8  | H-matrix for (a) (39, 32) Hamming SEC-DED code and (b) (39, 32) Hsiao SEC-DED code | 71 |
| Figure 4.9  | JTEC encoded bits.                                                                 | 78 |
| Figure 4.10 | Plot of voltage swing reduction as a function of word error rates                  | 81 |
| Figure 4.11 | Voltage Swing Reduction as a function of error correction capability.              | 82 |
| Figure 4.12 | Pipelined data path through a NoC switch including codecs.                         | 84 |
| Figure 4.13 | Energy Dissipation Profile for the Mesh based NoC.                                 | 87 |
| Figure 4.14 | Variation of Average Message Latency with injection Load for a                     | 88 |
| Figure 4.15 | wireline Mesh NoC.<br>Multi-path channel model for on-chip wireless links          | 90 |

| Figure 4.16 | SNR over the die area due to multipath radiation from a transmitter placed in the first subnet at its centre (X=2.5mm, Y=2.5mm). | 91  |
|-------------|----------------------------------------------------------------------------------------------------------------------------------|-----|
| Figure 4.17 | SNR vs. BER plot of the wireless channel with and without Coding.                                                                | 92  |
| Figure 4.18 | Schematic structure of the proposed H-PC encoder                                                                                 | 94  |
| Figure 4.19 | Different correctable error patterns.                                                                                            | 95  |
| Figure 4.20 | Packet energy dissipation and worst case channel BER for WiNoC and mesh architectures with and without ECC.                      | 97  |
| Figure 4.21 | Latency characteristics of mesh and WiNoC architectures with and without ECC.                                                    | 100 |

#### Dedication

This dissertation is dedicated to my parents and Rini

for whom this was possible

# Chapter 1 INTRODUCTION

#### 1.1 System-on-Chip Design Issues

State-of-the-art commercial System-on-Chip (SoC) designs are integrating a large number of intellectual property (IP) blocks, commonly known as cores, on a single die [1] [2]. This number, which is currently between ten and hundred depending on the application, is likely to go up in the near future. An important feature of such Multi-Processor SoC's (MP-SoC) is the interconnect fabric, which must allow seamless integration of numerous cores performing various functionalities at different clock frequencies. The growing complexity of integration as well as aggressive technology scaling introduces multiple challenges for the design of such big multi-core SoC's.

One of the major problems associated with future SoC designs arises from non-scalable global wire delays [3]. Global wires carry signals across a chip, but these wires typically do not scale in length with technology scaling. Though gate delays scale down with technology, global wire delays typically increase exponentially or, at best, linearly by inserting repeaters. Even after repeater insertion, the delay may exceed the limit of one clock cycle or even multiple clock cycles. In ultra-deep submicron processes, eighty percent or more of the delay of critical paths is due to interconnects. With supply voltage scaling down as ever and global wires becoming thinner the delay in transmission of signals over these wires will seriously affect the performance of the system. Long wires with lengths of the order of the dimensions of the die can have delays well over multiple clock cycles. This huge delay and the inherent complexity of integration of the IP cores necessitated new research to find a means of seamlessly integrating the multi-core SoC.

#### 1.2 The Network-on-Chip Paradigm

The network on chip (NoC) paradigm has emerged as an enabling solution to this problem of integration and has captured the attention of the academia and the industry [4]. The common characteristic of these NoC architectures is that the processor/storage cores communicate with each other through switches and links as shown in figure 1.1. Communication between constituent cores in a NoC takes place through packet switching. Generally wormhole switching is adopted for NoC's, which breaks down a packet into fixed length flow control units or *flits*. The first flit or the *header* contains routing information that helps to establish a path from the source to destination, which is subsequently followed by all the other *payload* flits. By design the lengths of the inter-switch wires are kept within such limits as would enable communication in less than a clock cycle which enables a pipelined communication infrastructure.



Figure 1.1. A regular tile based Mesh NoC

#### 1.3 Limitations of Conventional NoCs

Despite their several advantages, an important performance limitation in traditional NoCs arises from planar metal interconnect-based multi-hop communications, wherein the data transfer between two distant blocks causes high latency and power consumption. To alleviate this problem, insertion of long-range links in a standard mesh NoC using conventional metal wires has been proposed [5]. Another effort to improve the performance of multi-hop NoC was undertaken by introducing ultra-low-latency and low power express channels between communicating nodes [6, 7]. But these express channels are also basically metal wires, though they are significantly more power and delay efficient compared to their more conventional counterparts. According to the International Technology Roadmap for Semiconductors (ITRS) [8] for the longer term, improvements in metal wire characteristics will no longer satisfy performance requirements and new interconnect paradigms are needed. Different approaches have been explored already, such as 3D and photonic NoCs and NoC architectures with multiband RF interconnect [9-11]. Though all these emerging methodologies are capable of improving the power and latency characteristics of the traditional NoC, they need further and more extensive investigation to determine their suitability for replacing and/or augmenting existing metal/dielectric-based planar multi-hop NoC architectures. Consequently, it is important to explore further alternative strategies to address the limitations of planar metal interconnect-based multi-hop NoCs.

#### 1.4 One Possible Solution

In this work, we propose an innovative and novel approach, which addresses simultaneously the latency, power consumption and interconnect routing problems: replacing multi-hop wired paths in a NoC by high-bandwidth single-hop long-range wireless links.

Over the last few years there have been considerable efforts in the design and fabrication of miniature antennas operating in the range of tens of gigahertz to hundreds of terahertz, opening up the possibility of designing on-chip wireless links [12-14]. It is also predicted that the intrachip communication bandwidth achievable with conventional CMOS-based RF technology is not going to be sufficient [15]. Hence, the need to explore alternative technologies arises. Recent research has uncovered excellent emission and absorption characteristics leading to dipole like radiation behavior in carbon nanotubes (CNTs), making them promising for use as antennas for on-chip wireless communication [14]. In this work, the design principles of Wireless Networkon-Chip (WiNoC) architectures using CNT antennas are presented. Modern complex network theory [16] provides us with a powerful method to analyze network topologies and their properties. Between a regular, locally interconnected mesh network and a completely random Erdős-Rényi topology, there are other classes of graphs [16], such as small-world and scale-free graphs. Networks with the small-world property have a very short average path length, which is commonly measured as the number of hops between any pair of nodes. Also, such networks have a high clustering parameter which is an index of the connectivity of the topology. The average shortest path length of small-world graphs is bounded by a polynomial in log(N), where N is the number of nodes, which makes them particularly interesting for efficient communication with minimal resources [17, 18]. Most complex networks, such as social networks, the Internet, as well as certain parts of the brain exhibit the small-world property. It has been shown that such "shortcuts" in NoCs can significantly improve the performance compared to locally interconnected mesh-like networks [5, 18] with fewer resources than a fully connected system. This feature of small-world graphs makes them particularly attractive for constructing scalable

WiNoCs. This is because by using miniature transceivers it is possible to establish long range, low power wireless links across the chip to create shortcuts which enable the small-world based topologies.

The performance benefits of these WiNoCs due to the utilization of high-speed wireless links in a small-world based topology are evaluated through cycle accurate simulations. On-chip wireless links enable one-hop data transfers between distant nodes and hence reduce the hop counts in inter-core communication. In addition to reducing interconnect delay, eliminating multi-hop long distance wired communication reduces energy dissipation as well. In future the number of cores in a SoC is expected to increase manifold. Consequently it is imperative to have a scalable communication infrastructure without affecting system performance significantly. This work proposes a scalable WiNoC architecture and evaluates its performance with respect to conventional wired NoCs. It is demonstrated that by utilizing the wireless medium efficiently, it is possible to minimize the effects of scaling up the system size on the performance of the WiNoCs. It is possible to create various configurations for the WiNoC depending on the number of available wireless channels and their placement in the network. The various WiNoC architectures considered in this work are shown to dissipate significantly less energy and to achieve notable improvements in throughput and latency compared to traditional wired NoCs.

#### 1.5 Signal Integrity in WiNoCs

The ITRS [8] has predicted signal integrity to be a major challenge in current and future technology generations. Transient errors are becoming increasingly important due to increase in crosstalk, ground bounce and timing violations. These transient events are made more and more probable due to several reasons. With increased device density, the layout dimensions are shrinking and hence the charge used for storing the information bits in memory as well as logic,

is reducing in magnitude [19]. Shrinking storage charges also make the chips vulnerable to events like alpha particle hits. Increasing gate counts force designers to lower the supply voltages to keep power dissipation reasonable and thus reduce noise margins. Highly packed wires increases coupling between adjacent wires and opposing transitions induce crosstalk generated faults on these lines. Faster switching rates cause ground bounce and timing violations which manifest as transient errors. There are several ways to address signal integrity issues in an on chip environment like minimization of radiation exposure, careful layout, use of new materials and error control coding schemes. Moreover, the performance of the wireless links in the WiNoC depends on the CNT antennas. Like any other nanodevices, CNT antennas are expected to have higher manufacturing defect rates, operational uncertainties and process variability [20]. Error control coding (ECC) enables us to address the transient sources of errors at a higher level of abstraction in the system design phase rather than at a post design, layout phase. For an on chip environment we need simple coding schemes that will not impose a limiting overhead due to the encoding and decoding complexity. Different error rates due to distinct events in the wireless and wireline links require different ECC schemes. In this work we evaluate the error rates of both types of links in the WiNoCs and design appropriate ECC schemes to enable corrective intelligence in the WiNoC fabric.

#### 1.6 Contributions

The principal contribution of this thesis can be summarized as below:

- Architecture space exploration to enhance the performance of NoCs with wireless links
  - Design of hybrid Wireless NoC (WiNoC) with hierarchical Small-World topologies with wireless shortcuts.
  - Design of efficient communication and routing protocols for such an NoC

- Design of NoC components like switches and Wireless Base Stations (WBs) for the WiNoCs
- Optimized deployment of wireless transceivers with respect to varying traffic patterns.
- o Analysis and minimization of associated overheads for wireless link deployment.

#### Comparative analysis of radical interconnect technologies

- A comparative study of achievable performance advantages of NoC architectures with various radical interconnect technologies like 3D integration, photonic NoCs and RF-Interconnect based NoCs.
- Comparison of alternatives and establishment of benchmarks with various parameters like system size and traffic patterns.
- Conclusive arguments for choice of best technology for particular environments.

#### • Reliability in WiNoC fabrics

- Design of a novel Joint Crosstalk Avoidance Triple Error Correction And Quadruple Error Detection code (JTEC-SQED) that has higher transient error resilience as well as similar crosstalk avoidance characteristics as the best sole crosstalk avoidance codes.
- Design of a Product Code based multiple or burst error correction code to address the reliability issues of the wireless links.

#### 1.7 Thesis Organization

The thesis is organized in five chapters. The first chapter introduces the complexity of the problem and the possible means of addressing those issues. Literature survey is presented in the second chapter. The third chapter presents the main design methodologies and performance of the proposed hybrid wireless NoC architectures. In this chapter it is demonstrated that the WiNoCs outperform the wireline counterparts in network performance as well as by several

orders of magnitude in energy dissipation. The fourth chapter addresses the signal integrity issues of the WiNoC pertaining to both the wireline and wireless links. It has been observed that the wireless links prove to be the bottleneck in terms of reliability of the WiNoC. Consequently, it is shown that by using novel error control coding schemes it is possible to restore the reliability of the WiNoC to that of the wireline NoCs and still achieve significant performance benefits. Finally the last chapter summarizes the important conclusions and points out the direction of future research.

#### 1.8 Reference

- [1] P. Magarshack and P.G. Paulin, "System-on-Chip beyond the Nanometer Wall," Proceedings of Design Automation Conference (DAC 03), ACM Press, 2003, pp. 419-424.
- [2] L. Benini and G. De Micheli, "Networks on Chips: A New SoC Paradigm," IEEE Computer, Jan. 2002, pp. 70-78.
- [3] R. Ho, K. W. Mai, M.A. Horowitz, "The Future of Wires", Proceedings of the IEEE, Vol. 89 Issue: 4, April 2001 pp. 490–504.
- [4] W. J. Dally, B. Towles, "Route Packets, Not Wires: On-chip Interconnection Networks", Proceedings of Design and Automation Conference (DAC 01), ACM Press, 2001, pp. 684-689.
- [5] U. Y. Ogras and R. Marculescu, "It's a Small World After All": NoC Performance Optimization Via Long-Range Link Insertion", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 7, July 2006, pp. 693-706.
- [6] A. Kumar et al., "Toward Ideal On-Chip Communication Using Express Virtual Channels," IEEE Micro, Vol. 28, Issue 1, January-February 2008, pp. 80-90
- [7] T. Krishna et al., "NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication," Proceedings of IEEE Symposium on High Performance Interconnects, HOTI, 26-28 August, 2008, pp. 11-20.
- [8] ITRS 2007, http://www.itrs.net/Links/2007ITRS/Home2007.htm
- [9] V. F. Pavlidis and E. G. Friedman, "3-D Topologies for Networks-on-Chip," IEEE Transactions on Very Large Scale Integration (VLSI), Vol. 15, Issue 10, October 2007, pp. 1081-1090.

- [10] A. Shacham et al., "Photonic Network-on-Chip for Future Generations of Chip Multi-Processors," IEEE Transactions on Computers, Vol. 57, no. 9, 2008, pp. 1246-1260.
- [11] M. F. Chang et al., "CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect," Proc. of IEEE International Symposium on High-Performance Computer Architecture (HPCA), 16-20 February, 2008, pp. 191-202.
- [12] J. Lin et al., "Communication Using Antennas Fabricated in Silicon Integrated Circuits," IEEE Journal of Solid-State Circuits, vol. 42, no. 8, August 2007, pp. 1678-1687.
- [13] P. J. Burke et al., "Quantitative Theory of Nanowire and Nanotube Antenna Performance," IEEE Transactions on Nanotechnology, Vol. 5, No. 4, July 2006, pp. 314-334.
- [14] K. Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, vol. 19, 2007, pp. 421-426.
- [15] K.K.O et al., "The feasibility of on-chip interconnection using antennas," Proc. of IEEE/ACM International Conference on Computer-Aided Design, 2005. ICCAD-2005, pp. 979-984.
- [16] R. Albert and A.-L. Barabasi. "Statistical mechanics of complex networks," Reviews of Modern Physics, 74:47–97, January 2002.
- [17] M. Buchanan. "Nexus: Small Worlds and the Groundbreaking Theory of Networks." Norton, W. W. & Company, Inc, 2003.
- [18] C. Teuscher, "Nature-Inspired Interconnects for Self-Assembled Large-Scale Network-on-Chip Designs," Chaos, 17(2):026106, 2007.
- [19] E. Dupont, M. Nicolaidis, P. Rohr, "Embedded Robustness IPs for Transient-Error-Free ICs", IEEE Design and Test of Computers, Volume 19, Issue 3, May-June 2002 pp: 54 – 68.
- [20] R. I. Bahar et al. "Architectures for Silicon Nanoelectronics and Beyond," IEEE Computer, Vol. 40, Issue 1, January 2007, pp. 25-33.

# Chapter 2 Related Work

#### 2.1 Background

Conventional NoCs use multi-hop packet switched communication. At each hop the data packet goes though a router/switch, which contributes considerable power, throughput and latency overhead. To improve performance, the concept of express virtual channels is introduced in [1]. It is shown that by using virtual express lanes to connect distant cores in the network, it is possible to avoid the router overhead at intermediate nodes, and thereby greatly improve NoC performance in terms of power, latency and throughput. Performance is further improved by incorporating ultra low-latency, multi-drop on-chip global lines (G-lines) for flow control signals [2]. In [3, 4], performance of NoCs has been shown to improve by insertion of long range wired links following principles of small world graphs [5]. Despite significant performance gains, the schemes in [2], [3] and [4] still require laying out long wires across the chip and hence performance improvements beyond a certain limit may not be achievable.

The performance improvements due to NoC architectural advantages will be significantly enhanced if 3D integration is adopted as the basic fabrication methodology. The amalgamation of two emerging paradigms, namely NoCs in a 3D IC environment, allows for the creation of new structures that enable significant performance enhancements over traditional solutions [6], [7, 8]. Despite these benefits, 3D architectures pose new technology challenges such as thinning of the wafers, inter-device layer alignment, bonding, and interlayer contact patterning [9]. Additionally, the heat dissipation in 3D structures is a serious concern due to increased power density [9, 10] on a smaller footprint. There have been some efforts to achieve near speed-oflight communications through on-chip wires [11, 12]. Though these techniques achieve very low delay in data exchange along long wires, they suffer from significant power and area overheads from the signal conditioning circuitry. Moreover the speed of communication is actually about a factor of one-half the speed of light in silicon dioxide. By contrast, on-chip data links at the true velocity of light can be designed using recent advances in silicon photonics [13, 14]. The design principles of a photonic NoC are elaborated in [14] and [15]. The components of a complete Photonic NoC, e.g., dense waveguides, switches, optical modulators and detectors, are now viable for integration on a single silicon chip. It is estimated that a Photonic NoC will dissipate an order of magnitude less power than an electronic planar NoC. Although the optical interconnect option has many advantages, some aspects of this new paradigm need more extensive investigation. The speed of light in the transmitting medium, losses in the optical waveguides, and the signal noise due to coupling between waveguides are other important issues that need more careful investigation. Moreover, Photonic NoCs demonstrated in [14] and [15] still require an underlying electrical network to establish the path through the photonic links due to lack of optical storage elements. However, in [16] a completely photonic CLOS network is shown to achieve significant performance benefits over the wireline counterparts. In [17] CORONA, an amalgamation of 3D architecture and photonic NoC, is presented and demonstrated to deliver high network bandwidths for various real-application based traffic models. Another alternative is NoCs with multi-band RF interconnects [18]. Various implementation issues of this approach are discussed in [19]. In this particular NoC, instead of depending on the charging/discharging of wires for sending data, electromagnetic (EM) waves are guided along on-chip transmission lines created by multiple layers of metal and dielectric stack [18]. As the EM waves travel at the effective speed of light, low latency and high bandwidth communication can be achieved by this concept. This type of NoC too, is predicted to

dissipate an order of magnitude less power than the traditional planar NoC with significantly reduced latency.

On-chip wireless interconnects were demonstrated first in [20] for distributing clock signals. Recently, the design of a wireless NoC based on CMOS Ultra Wideband (UWB) technology was proposed [21]. The particular antennas used in [21] achieve a transmission range of 1 mm with a length of 2.98 mm. Consequently, for a NoC spreading typically over a die area of 20mmx20mm, this architecture essentially requires multi-hop communication through the onchip wireless channels. Moreover, the overheads of a wireless link are difficult to justify for 1 mm range of on-chip communication compared to a wired channel. Having wireless nodes spread all over the die will introduce significant overhead due to antennas and associated transceiver circuits. However the performance of silicon integrated on-chip antennas for intraand inter-chip communication with longer range have been already demonstrated by the authors of [21]. They have primarily used metal zig-zag antennas operating in the range of tens of GHz. In [22], the feasibility of designing on-chip wireless communication network with miniature antennas and simple transceivers that operate at the sub-THz range of 100-500 GHz has been demonstrated. The propagation mechanisms of radio waves over intra-chip channels with integrated antennas were also investigated [23]. Depending on antenna configuration and substrate characteristics, achievable frequency of the wireless channel can be in the range of 50-100 GHz. A relatively long intra-chip communication range facilitates single-hop communication between widely separated blocks. This is essential to achieve the full benefit of on-chip wireless networks for multi-core systems by reducing long distance multi-hop wireline communication. Despite all these advantages, in the mm-wave range the antenna size (~1-2 mm) is still a limitation. If the transmission frequencies can be increased to THz/optical range then the

corresponding antenna sizes decrease, occupying much less chip real estate. One possibility is to use nanoscale antennas based on CNTs operating in the THz/optical frequency range [24, 25]. Consequently building an on-chip wireless interconnection network using optical frequencies for inter-core communications becomes feasible with much less overhead than the mm-wave antennas. But unlike the mm-wave antennas, CNTs will face significant manufacturing challenges. All these investigations regarding miniaturized antennas motivated us to undertake a series of studies on the design of novel wireless communication infrastructures for multi-core SoCs.

#### 2.2 Reference

- [1] U. Y. Ogras and R. Marculescu, "It's a Small World After All": NoC Performance Optimization Via Long-Range Link Insertion", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 7, July 2006, pp. 693-706.
- [2] A. Kumar et al., "Toward Ideal On-Chip Communication Using Express Virtual Channels," IEEE Micro, Vol. 28, Issue 1, January-February 2008, pp. 80-90.
- [3] U. Y. Ogras and R. Marculescu, "It's a Small World After All": NoC Performance Optimization Via Long-Range Link Insertion", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 7, July 2006, pp. 693-706.
- [4] C. Teuscher. Nature-Inspired Interconnects for Self-Assembled Large-Scale Network-on-Chip Designs. Chaos, 17(2):026106, 2007.
- [5] D. J. Watts and S. H. Strogatz, "Collective dynamics of 'small-world' networks," Nature 393, 440–442, 1998.
- [6] V. F. Pavlidis and E. G. Friedman, "3-D Topologies for Networks-on-Chip", IEEE Transactions on Very Large Scale Integration (VLSI), Vol. 15, Issue 10, October 2007, pp. 1081-1090.
- [7] B. Feero and P. P. Pande, "Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation", IEEE Transactions on Computers, Vol. 58, No. 1, January 2009, pp. 32-45.
- [8] D. Park et al., "MIRA: A Multi-layered On-Chip Interconnect Router Architecture", IEEE International Symposium on Computer Architecture, ISCA, 21-25 June 2008, pp. 251-261.

- [9] W. R. Davis et al., "Demystifying 3D ICs: The pros and cons of going vertical." IEEE Design and Test of Computers, Vol. 22, Issue 6, November-December. 2005, pp. 498-510.
- [10] A. W. Topol et al., "Three-dimensional integrated circuits," IBM Journal of Research & Development. Vol. 50 No. 4/5 July/September 2006.
- [11] A. P. Jose et al., "Pulsed Current-Mode Signaling for Nearly Speed-of-Light Intrachip Communication", IEEE Journal of Solid-State Circuits, Vol. 41, No. 4, April 2006, pp. 772-780.
- [12] R. T. Chang et al., "Near Speed-of-Light Signaling Over On-Chip Electrical Interconnects", IEEE Journal of Solid-State Circuits, Vol. 38, No. 5, May 2003, pp. 834-838.
- [13] I. O'Connor et al., "Systematic Simulation-Based Predictive Synthesis of Integrated Optical Interconnect", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 15, No. 8, August 2007, pp. 927-940.
- [14] M. Briere et al., "System Level Assessment of an Optical NoC in an MPSoC Platform", Proceedings of IEEE Design, Automation & Test in Europe Conference & Exhibition, DATE, 16-20 April, 2007, pp-1084-1089.
- [15] A. Shacham et al., "Photonic Network-on-Chip for Future Generations of Chip Multi-Processors", IEEE Transactions on Computers, Vol. 57, no. 9, 2008, pp. 1246-1260.
- [16] A. Joshi et al., "Silicon-Photonic Clos Network for Global On-Chip Communication", Proceedings of the 3<sup>rd</sup> International Symposium on Networks-on-Chip (NOCS-3), May 2009, pp. 124-133.
- [17] D. Vantrease et al., "Corona: System Implications of Emerging Nanophotonic Technology," Proc. of IEEE International Symposium on Computer Architecture (ISCA), 21-25 June, 2008, pp. 153-164.
- [18] M. F. Chang et al., "RF Interconnects for Communications On-Chip", Proceedings of International Symposium on Physical Design, 13-16 April 2008, pp. 78-83.
- [19] M. F. Chang et al., "CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect", Proceedings of IEEE International Symposium on High-Performance Computer Architecture (HPCA), 16-20 February, 2008, pp. 191-202.
- [20] B. A. Floyd et al., "Intra-Chip Wireless Interconnect for Clock Distribution Implemented With Integrated Antennas, Receivers, and Transmitters", IEEE Journal of Solid-State Circuits, Vol. 37, No. 5, May 2002, pp. 543-552.

- [21] D. Zhao and Y. Wang, "SD-MAC: Design and Synthesis of A Hardware-Efficient Collision-Free QoS-Aware MAC Protocol for Wireless Network-on-Chip", IEEE Transactions on Computers, vol. 57, no. 9, September 2008, pp. 1230-1245.
- [22] S. B. Lee et al., "A Scalable Micro Wireless Interconnect Structure for CMPs", Proceedings of ACM Annual International Conference on Mobile Computing and Networking (MobiCom), September, 2009, pp. 20-25.
- [23] J. Lin et al., "Communication Using Antennas Fabricated in Silicon Integrated Circuits", IEEE Journal of solid-state circuits, vol. 42, no. 8, August 2007, pp. 1678-1687.
- [24] P. J. Burke et al., "Quantitative Theory of Nanowire and Nanotube Antenna Performance", IEEE Transactions on Nanotechnology, Vol. 5, No. 4, July 2006, pp. 314-334.
- [25] K. Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, vol. 19, 2007, pp. 421-426.

# Chapter 3 Wireless NoC Architecture

In a generic wired NoC the constituent embedded cores communicate via multiple switches and wired links. This multi-hop communication results in data transfers with high energy dissipation and latency. To alleviate this problem we propose long-distance high bandwidth wireless links between distant cores in the chip. In the following subsections we will explain the design of a scalable architecture for WiNoCs of various system sizes.

#### 3.1 Design Methodologies

In the following subsections, the design methodologies essential for design of a WiNoC are discussed.

#### 3.1.1. Topology

Modern complex network theory [1] provides us with a powerful method to analyze network topologies and their properties. Between a regular, locally interconnected mesh network and a completely random Erdős-Rényi topology, there are other classes of graphs [1], such as small-world and scale-free graphs. Networks with the small-world property have a very short average path length, which is commonly measured as the number of hops between any pair of nodes. The average shortest path length of small-world graphs is bounded by a polynomial in log(N), where N is the number of nodes, which makes them particularly interesting for efficient communication with minimal resources [2, 3]. This feature of small-world graphs makes them particularly attractive for constructing scalable WiNoCs. Most complex networks, such as social networks, the Internet, as well as certain parts of the brain exhibit the small-world property. A small-world

topology can be constructed from a locally connected network by re-wiring connections randomly to any other node, which creates short cuts in the network [4]. These random long-range links between nodes can also be established following probability distributions depending on the distance separating the nodes [29]. It has been shown that such "shortcuts" in NoCs can significantly improve the performance compared to locally interconnected mesh-like networks [3, 6] with fewer resources than a fully connected system.

Our goal here is to use the "small-world" approach to build a highly efficient NoC based on both wired and wireless links. Thus, for our purpose, we first divide the whole system into multiple small clusters of neighboring cores and call these smaller networks subnets. As subnets are smaller networks, *intra-subnet* communication will have a shorter average path length than a single NoC spanning the whole system. Figure 3.1(a) shows a subnet with mesh topology. This mesh subnet has NoC switches and links as in a standard mesh based NoC. The cores are connected to a centrally located hub through direct links and the hubs from all subnets are connected in a 2<sup>nd</sup> level network forming a hierarchical network. This upper level of the hierarchy is designed to have characteristics of small-world graphs. Due to a limited number of possible wireless links, as discussed in later subsections, neighboring hubs are connected by traditional wired links forming a bi-directional ring and a few wireless links are distributed between hubs separated by relatively long distances. Reducing long-distance multi-hop wired communication is essential in order to achieve the full benefit of on-chip wireless networks for multi-core systems. As the links are initially established probabilistically, the network performance might not be optimal. Hence, after the initial placement of the wireless links the network is further optimized for performance by using Simulated Annealing (SA) [7]. The particular probability distribution and the heuristics followed in establishing the network links



Figure 3.1. (a) Mesh topology of subnet with a hub connected to all switches in the subnet. (b) Network topology of hubs connected by a small-world graph with both wired and wireless links.

are described in the next subsection. Key to our approach is establishing optimal overall network topology under given resource constraints, i.e., a limited number of wireless links. Figure 3.1(b) shows a possible interconnection topology with 8 hubs and three wireless links. Instead of the ring used in this example, the hubs can be connected in any other possible interconnect architecture. The size and number of subnets are chosen such that neither the subnets nor the upper level of the hierarchy become too large. This is because if either level of the hierarchy becomes too large then it causes a performance bottleneck by limiting the data throughput in that level. However, since the architecture of the two levels can be different causing their traffic characteristics to differ from each other, the exact hierarchical division can be obtained by performing system level simulations as shown in section 3.2.

We propose a hybrid wired/wireless NoC architecture. The hubs are interconnected via both wireless and wired links while the subnets are wired only. The hubs with wireless links are equipped with wireless base stations (WBs) that transmit and receive data packets over the wireless channels. When a packet needs to be sent to a core in a different subnet it travels from the source to its respective hub and reaches the hub of the destination subnet via the small-world network consisting of both wireless and wired links, where it is then routed to the final destination core. For inter-subnet and intra-subnet data transmission, wormhole routing is adopted. Data packets are broken down into smaller parts called flow control units or *flits* [8]. The header flit holds the routing and control information. It establishes a path, and subsequent payload or body flits follow that path. The routing protocol is described in Section 3.1.4.

#### 3.1.2 Wireless link insertion and optimization

As mentioned above, the overall interconnect infrastructure of the WiNoC is formed by connecting the cores in the subnets with each other and to the central hub through traditional metal wires. The hubs are then connected by wires and wireless links such that the  $2^{nd}$  level of the network has the small-world property. The placement of the wireless links between a particular pair of source and destination hubs is important as this is responsible for establishing high-speed, low-energy interconnects on the network, which will eventually result in performance gains. Initially the links are placed probabilistically; i.e., between each pair of source and destination hubs, *i* and *j* respectively, the probability  $P_{ij}$  of having a wireless link is proportional to the distance measured in number of hops along the ring,  $h_{ij}$ , as shown in equation 3.1.

$$P_{ij} = \frac{h_{ij}}{\sum_{i,j} h_{ij}}$$
(3.1)

The probabilities are normalized such that their sum is equal to one. Such a distribution is chosen because in the presence of a wireless link, the distance between the pair becomes a single hop and hence it reduces the original distance between the communicating hubs through the ring. Depending on the number of available wireless links, they are inserted between randomly chosen pairs of hubs, which are chosen following the probability distribution mentioned above.

Once the network is initialized, an optimization by means of SA heuristics is performed. Since the subnet architectures are independent of the top level network, the optimization can be done only on the top level network of hubs and hence the subnets can be decoupled from this step. The optimization step is necessary as the random initialization might not produce the optimal network topology. SA offers a simple, well established and scalable approach for the optimization process as opposed to a brute force search.

If there are N hubs in the network and n wireless links to distribute, the size of the search space S is given by

$$\left|S\right| = \binom{\binom{N}{2} - N}{n}.$$
(3.2)

Thus, with increasing *N*, it becomes increasingly difficult to find the best solution by exhaustive search. In order to perform SA, a metric has been established, which is closely related to the connectivity of the network. The metric to be optimized is the average distance, measured in number of hops, between all source and destination hubs. To compute this metric the shortest distances between all hub pairs are computed following the routing strategy outlined in Section 3.1.4. In each iteration of the SA process, a new network is created by randomly rewiring a wireless link in the current network. The metric for this new network is calculated and compared

to the metric of the current network. The new network is always chosen as the current optimal solution if the metric is lower. However, even if the metric is higher we choose the new network probabilistically. This reduces the probability of getting stuck in a local optimum, which could happen if the SA process were to never choose a worse solution. The exponential probability shown in (3) is used to determine whether or not a worse solution is chosen as the current optimal:

$$P(h,h',T) = \exp[(h-h')/T].$$
(3.3)

The optimization metrics for the current and new networks are h and h' respectively. T is a temperature parameter, which decreases with the number of optimization iterations according to an *annealing schedule*. In this work we have used Cauchy scheduling, where the temperature varies inversely with the number of iterations [7]. The algorithm used to optimize the network is shown in figure 3.2.

Here we assume a uniform spatial traffic distribution where a packet originating from any core is equally likely to have any other core on the die as its destination. However, with other kinds of spatial traffic distributions, where the network loads are localized in different clusters, the metric for optimization has to be changed accordingly to account for the non-uniform traffic patterns as discussed later in the chapter.

An important component in the design of the WiNoCs is the on-chip antenna for the wireless links. In the next section we describe various alternative on-chip antenna choices and their pros and cons.

#### 3.1.3 On-Chip Antennas

Suitable on-chip antennas are necessary to establish wireless links for WiNoCs. In [9] the authors demonstrated the performance of silicon integrated on-chip antennas for intra- and inter-

chip communication. They have primarily used metal zig-zag antennas operating in the range of tens of GHz. Design of an ultra wideband (UWB) antenna for inter- and intra-chip communication is elaborated in [10]. This particular antenna was used in the design of a wireless



Figure 3.2. Flow diagram for the simulated annealing based optimization of WiNoC architectures.

NoC [11] mentioned earlier in chapter 2. The above mentioned antennas principally operate in the millimeter wave (tens of GHz) range and consequently their sizes are on the order of a few millimeters.

If the transmission frequencies can be increased to THz/optical range then the corresponding antenna sizes decrease, occupying much less chip real estate. Characteristics of metal antennas operating in the optical and near-infrared region of the spectrum of up to 750 THz have been studied [13]. Antenna characteristics of carbon nanotubes (CNTs) in the THz/optical frequency range have also been investigated both theoretically and experimentally [13, 14]. Bundles of CNTs are predicted to enhance performance of antenna modules by up to 40dB in radiation efficiency and provide excellent directional properties in far-field patterns [15]. Moreover these antennas can achieve a bandwidth of around 500 GHz, whereas the antennas operating in the millimeter wave range achieve bandwidths of tens of GHz [15]. Thus, antennas operating in the THz/optical frequency range can support much higher data rates. CNTs have numerous characteristics that make them suitable as on-chip antenna elements for optical frequencies. Given wavelengths of hundreds of nanometers to several micrometers, there is a need for virtually one-dimensional antenna structures for efficient transmission and reception. With diameters of a few nanometers and any length up to a few millimeters possible, CNTs are the perfect candidate. Such thin structures are almost impossible to achieve with traditional microfabrication techniques for metals. Virtually defect-free CNT structures do not suffer from power loss due to surface roughness and edge imperfections found in traditional metallic antennas. In CNTs, ballistic electron transport leads to quantum conductance, resulting in reduced resistive loss, which allows extremely high current densities in CNTs, namely 4-5 orders of magnitude higher than copper. This enables high transmitted powers from nanotube antennas,
which is crucial for long-range communications. By shining an external laser source on the CNT, radiation characteristics of multi-walled carbon nanotube (MWCNT) antennas are observed to be in excellent quantitative agreement with traditional radio antenna theory [14], although at much higher frequencies of hundreds of THz. Using various lengths of the antenna elements corresponding to different multiples of the wavelengths of the external lasers, scattering and radiation patterns are shown to be improved. Such nanotube antennas are good candidates for establishing on-chip wireless communications links and are henceforth considered in this work. Chemical vapor deposition (CVD) is the traditional method for growing nanotubes in specific locations by using lithographically patterned catalyst islands. The application of an electric field during growth or the direction of gas flow during CVD can help align nanotubes. However, the high-temperature CVD could potentially damage some of the pre-existing CMOS layers. To alleviate this, localized heaters in the CMOS fabrication process to enable localized CVD of nanotubes without exposing the entire chip to high temperatures are used [16].

As mentioned above, the NoC is divided into multiple subnets. Hence, the WBs in the subnets need to be equipped with transmitting and receiving antennas, which will be excited using external laser sources. As mentioned in [17], the laser sources can be located off-chip or bonded to the silicon die. Hence their power dissipation does not contribute to the chip power density. The requirements of using external sources to excite the antennas can be eliminated if the electroluminescence phenomenon from a CNT is utilized to design linearly polarized dipole radiation sources [18]. But further investigation is necessary to establish such devices as successful transceivers for on-chip wireless communication.

To achieve line of sight communication between WBs using CNT antennas at optical frequencies, the chip packaging material has to be elevated from the substrate surface to create a

24

vacuum for transmission of the high frequency EM waves. Techniques for creating such vacuum packaging are already utilized for MEMS applications [19], and can be adopted to make creation of line of sight communication between CNT antennas viable. In classical antenna theory it is known that the received power degrades inversely with the 4<sup>th</sup> power of the separation between source and destination due to ground reflections beyond a certain distance. This threshold separation,  $r_0$  between source and destination antennas assuming a perfectly reflecting surface, is given by equation 3.4.

$$r_0 = \frac{2\pi H^2}{\lambda} \tag{3.4}$$

Here *H* is the height of the antenna above the reflecting surface and  $\lambda$  is the wavelength of the carrier. Thus, if the antenna elements are at a distance of *H* from the reflective surfaces like the packaging walls and the top of the die substrate, the received power degrades inversely with the square of the distance until it is  $r_0$ . Thus *H* can be adjusted to make the maximum possible separation smaller than the threshold separation  $r_0$  for a particular frequency of radiation used. Considering the optical frequency ranges of CNT antennas, depending on the separation between the source and destination pairs in a single chip, the required elevation is a few tens of microns only.

## 3.1.4 Routing and Communication Protocols

In the proposed WiNoC, intra-subnet data routing depends on the topology of the subnets. For example, if the cores within a subnet are connected in a mesh, then data routing within the subnet follows dimension order (e-cube) routing. Inter-subnet data is routed through the hubs, along the shortest path between the source and destination subnets in terms of number of links traversed. The hubs in all the subnets are equipped with a pre-routing block to determine this path through a search across all potential paths between the hubs of the source and destination subnets. In the current work, paths involving only a single wireless link and none or any number of wired links on the ring are considered. All such paths as well as the completely wired path on the ring are compared and the one with the minimum number of link traversals is chosen for data transfer. For a data packet requiring inter-subnet routing, this computation is done only once for the header flit at the hub of the originating subnet. The header flit needs to have a field containing the address of the intermediate hub with a WB that will be used in the path. Only this information is sufficient as the header follows the default wireline path along the ring to that hub with the WB from its source, which is also the shortest path along the ring. Since each WB has a single, unique destination, the header reaches that destination and is then again routed via the wireline path to its final destination hub using normal ring routing. The rest of the flits follow the header, as wormhole routing is adopted in this work. Considering only those paths that have a single wireless link reduces computational overheads in the WB routers as it limits the search space. As the wireless links are placed as long-distance shortcuts they are always comparable in length to the diameter of the ring. Hence the probability that a path with multiple wireless links between any source/destination pair will be shorter than paths with a single wireless link is extremely low. So in order to achieve the best trade-off between the router complexity and network performance, only paths with single wireless link are considered. Also, if two alternatives have the same number of hops, the one with the wireless link is chosen, as this will have less energy dissipation. In this routing scheme the path is predetermined at the source hub and hence, no cycles are possible. Consequently, there is no possibility of a deadlock or livelock.

An alternative routing approach is to avoid the one-time evaluation of the shortest path at the original source hub and adopt a distributed routing mechanism. In this scenario, the path is

determined at each node by checking for the existence of a wireless link at that node, which if taken will shorten the path length to the final destination. If this wireless link does not exist or shorten the path in comparison to the wireline path from that node, then the default routing mechanism along the ring is followed to the next node. This mechanism performs a check at every node by computing and comparing the path lengths by using the default wireline routing or the wireless link. The adopted centralized routing performs all the checks at the original source hub, which includes all the wireless links and the wireline path from the source to the destination. We will present the comparative performance evaluation of these two schemes later.

By using multiband laser sources to excite CNT antennas, different frequency channels can be assigned to pairs of communicating subnets. This will require using antenna elements tuned to different frequencies for each pair, thus creating a form of frequency division multiplexing (FDM) creating dedicated channels between a source and destination pair. This is possible by using CNTs of different lengths, which are multiples of the wavelengths of the respective carrier frequencies. High directional gains of these antennas, demonstrated in [14, 15], aid in creating directed channels between source and destination pairs. In [20], 24 continuous wave laser sources of different frequencies are used. Thus, these 24 different frequencies can be assigned to multiple wireless links in the WiNoC in such a way that a single frequency channel is used only once to avoid signal interference on the same frequencies. This enables concurrent use of multiband channels over the chip. The number of wireless links in the network can therefore vary from 24 links, each with a single frequency channel, to a single link with all 24 channels. Assigning multiple channels per link increases the link bandwidth. Currently, high-speed silicon integrated Mach-Zehnder optical modulators and demodulators, which convert electrical signals to optical signals and vice versa are commercially available [21]. The optical modulators can

provide 10Gbps data rate per channel on these links. At the receiver a low noise amplifier (LNA) can be used to boost the power of the received electrical signal, which will then be routed into the destination subnet. As noted in [20], this data rate is expected to increase manifold with future technology scaling. The modulation scheme adopted is non-coherent on-off



Figure 3.3. Adopted communication protocol for the wireless channel

keying (OOK), and therefore does not require complex clock recovery and synchronization circuits. Due to limitations in the number of distinct frequency channels that can be created through the CNT antennas, the flit width in NoCs is generally higher than the number of possible channels per link. Thus, to send a whole flit through the wireless link using a limited number of distinct frequencies, a proper channelization scheme needs to be adopted. In this work we assume a flit width of 32 bits. Hence, to send the whole flit using the distinct frequency channels, time division multiplexing (TDM) is adopted. The various components of the wireless channel viz., the electro-optic modulators, the TDM modulator/demodulator, the LNA and the router for routing data on the network of hubs are implemented as a part of the WB. Figure 3.3 illustrates the adopted communication mechanism for the inter-subnet data transfer. In this WiNoC example, we use a wireless link with 4 frequency channels. In this case, one flit is divided into 8 four bit nibbles, and each nibble is assigned a 0.1ns timeslot, corresponding to a bit rate of 10 Gbps. The bits in each nibble are transmitted simultaneously over four different carrier

frequencies. The routing mechanism discussed in this section is easily extendable to incorporate other addressing techniques like multicasting. Performance of traditional NoC architectures incorporating multicasting have been already investigated [22] and it can be similarly used to enhance the performance of the WiNoC developed in this work. For example, let us consider a subnet in a 16-subnet system, which tries to send packets to 3 other subnets such that one of them is diagonally opposite to the source subnet and the other two are on either side of it. In absence of long-range wireless links, using multicasting the zero load latency for the delivery of a single flit is 9 cycles whereas without multicasting the same flit will need 11 cycles to be delivered to the respective destinations. Here the communication takes place only along the ring. However, if a wireless link exists along the diagonal from the source to the middle destination subnet then with multicasting the flit can be transferred in 5 cycles if there are 8 distinct channels in the link. Four cycles are needed to transfer a 32-bit flit to the diagonally opposite hub via the wireless links and one more hop along the ring to the final destinations on either side. The efficiency of using multicasting varies with number of channels in the link as it governs the bandwidth of the wireless link.

## 3.2 Experimental Results

In this section we analyze the characteristics of the proposed WiNoC architectures and study trends in their performance with scaling of system size. For our experiments, we have considered three different system sizes, namely 128, 256, and 512 cores on a die of size 20mmx20mm. We observe results of scaling up the system size by increasing both the number of subnets as well as the number of cores per subnet. Hence, in one scenario, we have considered a fixed number of cores per subnet to be 16 and varied the number of subnets between 8, 16, and 32. In the other case, we have kept the number of subnets fixed at 16 and varied the size of the subnets from 8 to

32 cores. These system configurations are chosen based on the experiments explained later in section 3.2.3. Establishment of wireless links using simulated annealing, however, depends only on the number of hubs on the  $2^{nd}$  level of the network.

## 3.2.1 Establishment of Wireless Links

Initially the hubs are connected in a ring through normal wires and the wireless links are established between randomly chosen hubs following the probability distribution given by (1). We then use simulated annealing to achieve an optimal configuration by finding the positions of the wireless links which minimize the average distance between all source and destination pairs in the network. Figure 3.4 shows the location of 1, 6 and 24 wireless links with 24, 4 and 1 channels respectively in a network of 16 hubs. We followed the same optimization methodology for all the other networks. The corresponding average distances for the optimized networks with different system sizes are shown in table 3.1. It should be noted that the particular placement of wireless links to obtain the optimal network configuration is not unique because of symmetric considerations in our setup, i.e., there are multiple configurations with the same optimal



Figure 3.4. The optimal wireless link arrangement for (a) 1, (b) 6, and (c) 24 wireless links among 16 hubs. Note that symmetrical solutions with the same performance are possible.

performance.

In order to establish the performance of the SA algorithm used, we compared the resultant optimization metric with the metric obtained through exhaustive search for the optimized network configuration for various system sizes. The SA algorithm produces network configurations with total average

Table 3.1. Average distance for optimized WiNoCs

| N f     | Avg distance (hops) |          |             |  |
|---------|---------------------|----------|-------------|--|
| INO. OI | 1                   | 6        | 24 Wireless |  |
| (NI)    | Wireless            | Wireless | Links       |  |
| (11)    | Link                | Links    |             |  |
| 8       | 1.7188              | 1.3125   | 1.1250 *    |  |
| 16      | 3.2891              | 2.1875   | 1.5625      |  |
| 32      | 6.3301              | 3.8789   | 2.6309      |  |

\* In case of 8 subnets only 12 wireless links are used with 2 channels per link

hop count exactly equal to that generated by the exhaustive search technique for the system configurations considered in this work. However, the obtained WiNoC configuration in terms of topology is non-unique as different configurations can have the same average hop count. Figure 3.5(a) shows the number of iterations required to arrive at the optimal solution with SA and exhaustive search algorithms. Clearly the SA algorithm converges to the optimal configuration much faster than the exhaustive search technique. This advantage will increase for larger system sizes. Figure 3.5(b) shows the convergence of the metric for different values of the initial temperature to illustrate that the SA approach converges robustly to the optimal value of the



Figure 3.5. (a) Number of iterations required to reach optimal solution by the SA and exhaustive search methods (b) Convergence with different temperatures

average hopcount with numerical variation in the temperature. This simulation was performed for a system with 32 subnets with 1 wireless link. With higher values of the initial temperature it can take longer to converge. Naturally, for large enough values of the initial temperature, the metric does not converge. On the other hand, lower values of the initial temperature make the system converge faster but at the risk of getting stuck in a local optimum. Using the network configurations developed in this subsection, we will now evaluate the performance of the WiNoC based on well-established performance metrics.

#### 3.2.2 Performance Metrics

To characterize the performance of the proposed WiNoC architectures, we consider three network parameters: latency, throughput, and energy dissipation. Latency refers to the number of clock cycles between the injection of a message header flit at the source node and the reception of the tail flit at the destination. Throughput is defined as the average number of flits successfully received per embedded core per clock cycle. Energy dissipation per packet is the average energy dissipated by a single packet when routed from the source to destination node; both the wired subnets and the wireless channels contribute to this. For the subnets, the sources of energy dissipation are the inter-switch wires and the switch blocks. For the wireless channels, the main contribution comes from the WBs, which include antennas, transceiver circuits and other communication modules like the TDM block and the LNA. Energy dissipation per packet,  $E_{pkt}$ , can be calculated according to equation 3.5 below.

$$E_{pkt} = \frac{N_{\text{intrasubnet}} E_{subnet,hop} h_{subnet} + N_{\text{intersubnet}} E_{sw} h_{sw}}{(N_{\text{intrasubnet}} + N_{\text{intersubnet}})}$$
(3.5)

In equation 3.5,  $N_{intrasubnet}$  and  $N_{intersubnet}$  are the total number of packets routed within the subnet and between subnets respectively.  $E_{subnet,hop}$  is the energy dissipated by a packet traversing a single hop on the wired subnet including a wired link and switch, and  $E_{sw}$  is the energy dissipated by a packet traversing a single hop on the 2<sup>nd</sup> network level of the WiNoC, which has the smallworld properties.  $E_{sw}$  also includes the energy dissipation in the core to hub links. In equation 3.5,  $h_{subnet}$  and  $h_{sw}$  are the average number of hops per packet in the subnet and the small-world network.

## 3.2.3 Performance Evaluation

The network architectures developed earlier in this section are simulated using a cycle accurate simulator which models the progress of data flits accurately per clock cycle accounting for flits that reach destination as well as those that are dropped. One hundred thousand iterations were performed to reach stable results in each experiment, eliminating the effect of transients in the first few thousand cycles.

The mesh subnet architecture considered is shown in figure 3.1 (a). The width of all wired links is considered to be same as the flit size, which is 32 in this work. The particular NoC switch architecture, adopted from [23] for the switches in the subnets, has three functional stages, namely, input arbitration, routing/switch traversal, and output arbitration. The input and output ports including the ones on the wireless links have four virtual channels per port, each having a buffer depth of 2 flits [23]. Each packet consists of 64 flits. Similar to the intra-subnet communication, we have adopted wormhole routing in the wireless channel too. Consequently, the hubs have similar architectures as the NoC switches in the subnets. Hence, each port of the hub has same input and output arbitres, and equal number of virtual channels with same buffer depths as the subnet switches. The number of ports in a hub depends on the number of links connected to it. The hubs also have three functional stages, but as the number of cores increases in a subnet the delays in arbitration and switching for some cases are more than a clock cycle.



Figure 3.6. The components of a WB with multiple wired and wireless ports at input and output.

Depending on the subnet sizes, traversal through these stages need multiple cycles and this has been taken into consideration while evaluating overall latency of the WiNoC. The wireless ports of the WBs are assumed to be equipped with antennas, TDM modules, and electro-optic modulators and demodulators. The various components of a WB are shown in figure 3.6. A hub consisting of only ports to wired links is also highlighted in the figure to emphasize that a WB has additional components compared to a hub. A simple flow control mechanism is adopted uniformly for wireless links in which, the sender WB stops transmitting flits only when a *full* signal is asserted from the receiver WB. This full signal is embedded in a control flit sent from the receiver to the sender only when the receiver buffer is filled above a predefined threshold. When the full signal is asserted, flits do not move and are blocked spanning multiple switches or hubs. This in turn can block other messages in the network as in wormhole routing. In case all buffers are full the new injected packets from the cores are dropped until new buffer space is available. A more advanced flow control mechanism could be incorporated to improve WiNoC performance further [24]. The NoC switches, the hubs, and the wired links are driven with a clock of frequency 2.5 GHz.

Figure 3.7 shows throughput and latency plots as a function of injection load for a system with 256 cores divided into 16 subnets, each with 16 cores. The delays incurred by the wired links from the cores to the hub for varying number of cores in the subnets for different system sizes are shown in table 3.2. The delays in the inter-hub wires for varying number of subnets are also shown. As can be seen these delays are all less than the clock period of 400ps and it may be noted that the lengths of both core-to-hub and inter-hub wireline links will reduce with increase in the number of subnets as then each subnet becomes smaller in area and the subnets also come closer to each other. The delays incurred by the electro-optic signal conversions with the MZM devices are 20ps. When computing the overall system latency and throughput of the WiNoCs the



(a) (b) Figure 3.7. (a) Throughput and (b) latency of 256 core WiNoCs with different numbers of wireless links.

delays of these individual components are taken into account. This particular hierarchical topology was selected as it provided optimum system performance. Figure 3.8 shows the saturation throughputs for alternative ways of dividing the 256 core WiNoC into different numbers of subnets with a single wireless link. As can be seen from the plot all alternative configurations achieve worse saturation throughput. The same trend is observed if we vary the number of wireless links. Using the same method the suitable hierarchical division that achieves best performance is determined for all the other system sizes. For system sizes of 128 and 512, the hierarchical divisions considered here achieved much better performance compared to the other possible divisions with either lower or higher number of subnets.

By varying the number of channels in the wireless links, various WiNoC configurations are created. We have considered WiNoCs with 1, 6, and 24 wireless links in our experiments. Since the total number of frequencies considered in this work is 24, the number of channels per link is 24, 4 and 1 respectively. As can be seen from figure 3.7, the WiNoCs with different possible configurations outperform the single wired monolithic flat mesh architecture. It can also be observed that with increasing number of wireless links, throughput improves slightly. It should

be noted that even though increasing the number of links does increase the number of concurrent wireless communication links, the bandwidth on each link decreases as the total number of channels is fixed by the number of off-chip laser sources. This causes the total bandwidth over all the wireless channels to remain the same. The only difference is in the

Table 3.2. Delays on wired links in the WiNoCs

| WINGES .       |                               |                |       |         |
|----------------|-------------------------------|----------------|-------|---------|
| System<br>Size | System No. of<br>Size subnets | Subnet<br>size | Core- | Inter-  |
|                |                               |                | hub   | hub     |
|                |                               |                | link  | link    |
|                |                               |                | delay | delay   |
|                |                               |                | (ps)  | (ps)    |
| 128            | 8                             | 16             | 96    | 181/86* |
| 120            | 16                            | 8              | 60    | 86      |
| 256            | 16                            | 16             | 60    | 86      |
| 512            | 16                            | 32             | 60    | 86      |
|                | 32                            | 16             | 48    | 86/43*  |
|                |                               |                |       |         |

<sup>\*</sup>for 8 and 32 subnets the inter-subnet distances are different along the two planar directions

degree distribution of across the network. Consequently, network throughput increases only slightly with increasing number of wireless links. However, the hardware cost increases with increasing numbers of links as discussed section 3.2.7. in Thus. depending upon whether the demand on performance is critical the designer can choose to trade-off the area overhead of



\*\* NS = number of subnets, SS = subnet size

# Figure 3.8. Throughput of 256 core WiNoC for various hierarchical configurations.

deploying the maximum number of wireless links possible. However, if the constraints on area overhead are really stringent then one can choose to employ only one wireless link and consequently provide more bandwidth per link and have only a little negative effect on performance.

In order to observe trends among various WiNoC configurations, we performed further analysis. Figure 3.9 (a) shows the throughput at network saturation for various system sizes while keeping the subnet size fixed for different numbers of wireless links. Figure 3.9(b) shows the variation in throughput at saturation for different system sizes for a fixed number of subnets. For comparison, the throughput at network saturation for a single traditional wired mesh NoC of each system size is also shown in both of the plots. As in figure 3.7 it may be noted from figure 3.9 that for a WiNoC of any given size, number of subnets and subnet size, the throughput increases with increase in number of wireless links deployed.



Figure 3.9. Saturation throughput with varying (a) number of subnets and (b) size of each subnet.

As can be observed from the plots, the maximum achievable throughput in WiNoCs degrades with increasing system size for both cases. However, by scaling up the number of subnets, the degradation in throughput is smaller compared to when the subnet size is scaled up. By increasing the subnet size, we are increasing congestion in the wired subnets and load on the hubs and not fully using the capacity of the high speed wireless links in the upper level of the network. When the number of subnets scales up, traffic congestion in the subnets does not get worse and the optimal placement of the wireless links makes the top level network very efficient for data transfer. The effect on throughput with increasing system size is therefore marginal.

To determine the energy dissipation characteristics of the WiNoCs, we first estimated the energy dissipated by the antenna elements. As noted in [14], the directional gain of MWCNT antennas we propose to use is very high. The ratio of emitted power to incident power is around -5dB along the direction of maximum gain. Assuming an ideal line-of-sight channel over a few millimeters, transmitted power degrades with distance following the inverse square law. Therefore the received power  $P_R$  can be related to the transmitted power  $P_T$  as

$$P_R = \frac{G_T A_R}{4\pi R^2} P_T.$$
(3.6)

In equation 3.6,  $G_T$  is the transmitter antenna gain, which can be assumed to be -5dB [14].  $A_R$  is the area of the receiving antenna and R is the distance between the transmitter and receiver. The energy dissipation of the transmitting antennas therefore depends on the range of communication. The area of the receiving antenna can be found by using the antenna configuration used in [14]. It uses a MWCNT of diameter 200nm and length  $7\lambda$ , where  $\lambda$  is the optical wavelength. The length  $7\lambda$  was chosen as it was shown to produce the highest directional gain,  $G_T$ , at the transmitter. In one of the setups in [14], the wavelength of the laser used was 543.5nm, and hence the length of the antenna is around 3.8µm. Using these dimensions, the area of the receiving antenna,  $A_T$  can be calculated.

The noise floor of the LNA [25] is -101dBm. Considering the MZM demodulators cause an additional loss of up to 3dB over the the operational bandwidth, the receiver sensitivity turns out to be -98dBm in the worst case. The length of the longest possible wireless link considered among all WiNoC configurations is 23mm. For this length and receiver sensitivity, a transmitted power of 1.3mW is required. Considering the energy dissipation at the transmitting and receiving antennas, and the components of the transmitter and receiver circuitry such as the MZM, TDM block and the LNA, the energy dissipation of the longest possible wireless link on the chip is 0.33 pJ/bit. The energy dissipation of a wireless link, *E*<sub>Link</sub> is given as

$$E_{Link} = \sum_{i=1}^{m} (E_{antenna,i} + E_{transceiver,i}), \qquad (3.7)$$

where *m* is the number of frequency channels in the link and  $E_{antenna,i}$  and  $E_{transceiver,i}$  are the energy dissipations of the antenna element and transceiver circuits for the *i*th frequency in the link.

The network switches and hubs are synthesized from a RTL level design using 65nm

standard cell libraries from CMP [26], using Synopsys Design Vision and assuming a clock frequency of 2.5 GHz. A large set of data patterns were fed into the gate-level netlists of the network switches and hubs, and by running Synopsys<sup>TM</sup> Prime Power, their energy dissipation was obtained.

The energy dissipation of the wired links depends on their lengths. The lengths of the interswitch wires in the subnets can be found by using the formula

$$l_M = \frac{l_{edge}}{M - 1}.$$
(3.8)

Here, M is number of cores along a particular edge of the subnet and  $l_{edge}$  is the length of that edge. A 20mmx20mm die size is considered for all system sizes in our simulations. The interhub wire lengths are also computed similarly as these are assumed to be connected by wires parallel to the edges of the die in rectangular dimensions only. Hence, to compute inter-hub distances along the ring, parallel to a particular edge of the die, equation 3.8 is modified by changing M to the number of hubs along that edge and  $l_{edge}$  to the length of that particular edge. In each subnet the lengths of the links connecting the switches to the hub depend on the position of the switches as shown in figure 3.1(a). The capacitances of each wired link, and subsequently their energy dissipation, were obtained through HSPICE simulations taking into account the specific layout for the subnets and the 2<sup>nd</sup> level of the ring network.

Figures 3.10 (a) and (b) show the packet energy dissipation for each of the network configurations considered in this work. The packet energy for the flat wired mesh architecture is not shown as it is higher than that of the WiNoCs by orders of magnitude, and hence cannot be shown on the same scale. The comparison with the wired case is shown in table 3.3 in the next subsection along with another hierarchical wired architecture. From the plots it is clear that the packet energy dissipation increases with increasing system size. However, scaling up the number

| System<br>Size | Subnet<br>Size | No. of<br>Subnets | Flat<br>Mesh<br>(nJ) | WiNoC<br>(nJ) | NoC with<br>G-Line (nJ) |
|----------------|----------------|-------------------|----------------------|---------------|-------------------------|
| 128            | 16             | 8                 | 1319                 | 22.57         | 490.3                   |
| 256            | 16             | 16                | 2936                 | 24.02         | 734.5                   |
| 512            | 16             | 32                | 4992                 | 37.48         | 1012.8                  |

Table 3.3. Packet energy dissipation for flat wired mesh, WiNoC and hierarchicalG-line NoC architectures

of subnets has a lower impact on the average packet energy. The reason for this is that the throughput does not degrade much and the average latency per packet also does not change significantly. Hence, the data packets occupy the network resources for less duration, causing only a small increase in packet energy. However, with an increase in subnet size, the throughput degrades noticeably, and so does latency. In this case, the packet energy increases significantly as each packet occupies network resources for a longer period of time. With an increase in the number of wireless links while keeping the number of subnets and subnet size constant, the packet energy decreases. This is because higher connectivity of the network results in higher throughput (or lower latency), which means that packets get routed faster, occupy network



Figure 3.10. Packet energy dissipation with varying (a) number of subnets and (b) size of each subnet.

resources for less time, and consume less energy during the transmission. Since the wireline subnets pose the major bottleneck as is made evident by the trends in the plots of figures 3.9 and 3.10 their size should be optimized. In other words, smaller subnets imply better performance and lower packet energies. Hence, as a designer one should target to limit the size of the subnets as long as the size of the upper level of the network does not impact the performance of the overall system negatively. The exact optimal solution also depends on the architecture of the upper level of the network, which need not be restricted to the ring topology chosen in this work as an example.

The adopted centralized routing strategy is compared with the distributed routing discussed in section 3.1.4 for a WiNoC of size 256 cores split into 16 subnets with 16 cores in each. Twenty four wireless links were deployed in the exact same topology for both cases. With distributed routing the throughput was 0.67 flits/core/cycle whereas with centralized routing it was 0.72 flits/core/cycle as already noted in figure 3.9, which is 7.5% higher. Centralized routing uses non-optimal paths in some cases. Hence, the distributed routing has lower throughput. The distributed routing dissipates a packet energy of 31.2 nJ compared to 30.8 nJ with centralized routing. This is because on an average the number of path length computation with the distributed routing is more per packet, as this computation occurs at every intermediate WB. However, with centralized routing each hub has additional hardware overhead to compute the shortest path by comparing all the paths using the wireless links. This hardware area cost is discussed in section 3.2.7.

## 3.2.4 Comparison with wired NoCs

We evaluated the performance of the WiNoCs in terms of energy dissipation compared to



Figure 3.11. Components of packet energy dissipation for a (a) flat mesh and (b) WiNoC. Values of energy dissipation are labeled in nJ units.

different wired NoC architectures. As demonstrated in the last sub-section, with increase in system size, increasing the number of subnets while keeping the subnet size fixed is a better scaling strategy; hence, we followed that in the following analysis.

The first wired architecture considered was the conventional flat mesh architecture. Table 3.3 quantifies the energy dissipation per packet of the WiNoC and the wired architectures for various system sizes. The WiNoC configuration with 24 wireless links was chosen because it has the lowest packet energy dissipation among all the possible hybrid wired/wireless configurations. It is evident that the WiNoC consumes orders of magnitude less energy compared to the flat wired mesh network. Figure 3.11 shows the contributions of the various components of the packet energy dissipation for the WiNoC with 24 wireless links and the flat mesh architecture for a system size of 256 cores. The contributions of the antenna and the transceiver, which constitute the wireless link energy, are shown separately from the wireline links of the upper level small-world network. The largest contribution to packet energy in WiNoC is from the wireless and wireline link traversals combined in the upper level small-world network. This is because on an average a large portion of the packets travel through the upper level of the WiNoC to reach other subnets. However as this level has very small average path length due to its small-world nature and due to the low power wireless channels the absolute value of this energy dissipation is very

small.

The performance of the flat mesh NoC architectures can be improved by incorporating express virtual channels (EVC), which connect the distant cores in the network by bypassing intermediate switches/routers. It is demonstrated that the switch/router energy dissipation of the baseline mesh architecture is improved by about 25-38% depending on the system size by using dynamic EVCs. The energy dissipation profile is improved by another 8% over the EVC scheme by using low-swing, multi-drop, ultra low latency global interconnect (G-Lines) for the flow control signals [24]. Recently a number of papers have shown the possibility of communicating near speed-of-light across several millimeters on a silicon substrate. Among them, low swing, long range, and ultra-low-latency communication wires as proposed in [27] achieve higher bandwidth at lower power consumption [24]. G-lines use a capacitive pre-emphasis transmitter that increases the bandwidth and decreases the voltage swing without the need of an additional power supply. To avoid cross-talk, differential interconnects are implemented with a pair of twisted wires. A decision feedback equalizer is employed at the receiver to further increase the achievable data rate. It is evident that though introduction of EVCs improves the energy dissipation profile of a flat wired mesh NoC, the achievable performance gain is still limited compared to the gains achieved by the WiNoCs. This is because the basic architecture is still a flat mesh and the savings in energy principally arises from bypassing the intermediate NoC switches.

As a next step, we undertook a study where we compared the energy dissipation profile of the proposed hybrid NoC architecture using wireless links to that of the same hierarchical network using G-Lines as long-range communication links. To do so, we replaced the wireless links of the WiNoCs by the G-Lines while maintaining the same hierarchical topology with shortcuts in the upper level. Here, each G-line link is designed such that it has the same bandwidth as the wireless link it replaces. Thus the overall throughput and end-to-end latency of the hierarchical NoC with G-line links is the same as that of the WiNoC. We performed simulations in 65nm technology. The lumped wire resistance is 20 ohms/mm, and the capacitance is 400fF/mm. The simulated power dissipation is found to be 0.6mW/transmitter and 0.4mW/receiver. In order to achieve the same bandwidth as the wireless links in our experiments, multiple G-line links are used in place of a single wireless channel between a pair of source and destination hubs. For example, a single G-line can sustain a bandwidth of around 2.5 Gbps for a wire length of 11 mm, whereas each wireless channel can sustain a bandwidth of 10 Gbps. Therefore, to maintain the same datarate as provided by a single wireless channel, we need 4 G-lines between a source and destination pair separated by 11 mm. Moreover, since each G-line works on differential signals, we will need 8 wires to replace a single wireless link in this case.

The packet energy dissipation for a WiNoC and hierarchical NoC with G-line links are also shown in table 3.3 for various system sizes. The WiNoC's energy per packet consumption is one order of magnitude less compared to the hierarchical NoC with G-line links of the same bandwidth as the wireless channels. This experiment was conducted to highlight the savings in energy dissipation due to two factors viz., the architectural innovation proposed here and the use of on-chip wireless links in place of highly optimized wired connections. The difference in energy dissipation between the flat wired mesh NoC and the hybrid NoC with G-Lines arises primarily due to the architecture proposed here. The difference in energy dissipation between the WiNoCs and the hybrid NoC with G-lines is solely due to the use of wireless channels. Figure 3.12 shows the energy dissipation of a wireless link considered in this work and that of the G- line link as a function of communication distance between source and destination WBs considered here. This shows how high the energy dissipation of a G-line link of the same bandwidth as the wireless link is. The impact of this is reflected in the packet energy dissipation profiles shown in table 3.3, which is obtained after full system simulation



Figure 3.12. Energy dissipation per bit on Gline and wireless links with varying link lengths.

using these links. Table 3.4 shows the percentage of total packet energy dissipated on the wired and wireless links for a WiNoC with 128 cores divided into 16 subnets. The percentage of packet energy dissipated on the G-line links replacing the wireless links are also shown to signify the trade-off in energy dissipation as more wireline (G-Line) links are replaced with the wireless links for a single network configuration. As shown in tables 3.3 and 3.4 the hierarchical NoC with G-line links dissipate higher packet energy than the WiNoC and the long distance G-line links dissipate a considerably larger proportion of that high packet energy.

Another wireline architecture developed in [6] uses long range wired shortcuts to design a small world network over a basic mesh topology. We considered a system size of 128 cores and eight wireline shortcuts were optimally deployed on a basic wireline mesh following the scaling trend outlined in [6]. The chosen WiNoC configuration was 16 subnets with 8 cores in each with 8 wireless links. The throughput of the wireline small-world NoC proposed in [6] was 0.26 flits/core/cycle, which is 18.7% more than that of a flat mesh NoC. In comparison the WiNoC had a throughput of 0.75 flits/core/cycle. This huge gain was due to the hierarchical division of

the whole NoC as well as the high bandwidth wireless links used in creating the shortcuts. The packet energy dissipation for the NoC proposed in [6] for the configuration

Table 3.4. Percentage of packet energydissipation on long-range links

| No. of links | 1    | 6    | 24   |
|--------------|------|------|------|
| Wireless     | 0.5  | 2.8  | 3.4  |
| G-Line       | 47.3 | 95.8 | 98.5 |

mentioned above is 984nJ. This energy dissipation is about 25% less than the packet energy dissipation in a flat mesh. However, even this packet energy is an order of magnitude higher than that of the WiNoC for the same size of 128 cores as shown in table 3.3.

From the above analysis it is clear that the proposed WiNoC architectures outperform their corresponding wired counterparts significantly in terms of all the relevant network parameters. Moreover, the WiNoC is much more energy efficient compared to an exactly equivalent hierarchical wired architecture implemented with the recently proposed high bandwidth low latency G-lines as the long-range communication links.

#### 3.2.5 Comparative analysis with other emerging NoC paradigms

There are several emerging paradigms which enhance the performance of NoCs using nontraditional technology such as three dimensional integration, photonic interconnects, RF interconnect (RF-I) and on-chip wireless communication using UWB links. In this subsection we perform a comparative analysis to establish the relative performance benefits achieved by using these alternative techniques with specific system parameters. We consider a system with 128 cores and packet size of 64 flits. We map this to a 3D mesh-based NoC with four layers as in [28]. The photonic NoC architecture was adopted from [17]. For the RF-I NoC, we followed the architecture of [29] with eight sectors. For the UWB NoC, we followed the design shown in [11]. Figure 3.13 shows the achievable overall network bandwidth and the packet energy dissipation per unit bandwidth for all the different NoCs using alternative interconnect technologies. We

considered the packet energy dissipation per unit bandwidth as the various NoCs capable of achieving different are throughputs. WiNoC network The considered in this comparison had 16 subnets. Due to the multi-hop nature of communication, relatively higher transceiver energy, and lower achievable bandwidth, the UWB NoC dissipates 432.9 nJ/Tbps, which is orders of



Figure 3.13. Packet energy dissipation per bandwidth and achievable NoC bandwidth for various types of emerging NoC paradigms for system size of 128 cores.

magnitude more compared to all the other alternative solutions. That's why UWB NoC energy is not shown in the same plot. The achievable bandwidth of this architecture is 1.04Tbps which is also lower than that of the other emerging alternatives considered here.

It can be observed that among all the emerging NoC architectures, the hybrid WiNoC proposed in this work has the lowest packet energy dissipation per unit bandwidth and it also has the highest peak bandwidth. This is because in the WiNoC each of the 24 wireless channels can sustain a data rate of 10Gbps. WiNoC reduces the average hop count compared to both the 3D and RF-I NoCs. The photonic NoC considered in the comparative evaluation requires an electrical control network to configure photonic switching elements which uses a flat wireline mesh NoC. This causes overheads and hence limits its performance. However, this overhead can be reduced for longer packets. The lower bandwidth of the RF-I NoC compared to the WiNoC is due to the fact that it is essentially a flat wireline architecture as it uses a waveguide overlayed on an existing wireline mesh. The drop points to the RF-I become hotspots limiting the performance

of the RF-I NoC. However, an alternative photonic NoC architecture, Corona, demonstrated in [30] employs an optical network amalgamated onto a 3D chip. This particular architecture achieves a higher bandwidth than the WiNoC as it takes advantage of both photonic links and 3D integration simultaneously. For a uniform traffic distribution, Corona with 256 cores is shown to achieve a bandwidth of 4.5TBps. In comparison a 256 core WiNoC segmented into 16 subnets with 24 wireless links can achieve a peak bandwidth of 1.8TBps for the same traffic pattern. A more detailed performance benchmarking across all emerging NoC paradigms for various system parameters is the subject of a future investigation.

#### 3.2.6 Traffic dependent wireless link insertion

So far we assumed a uniformly random spatial distribution of traffic between the hubs. We also principally considered the distance between the hubs to be the deciding factor for choosing the positions of the wireless links. However, in reality there could be non-uniform traffic distributions with a particular pair of hubs communicating more frequently between themselves than with the others. In order to optimize our network for such non- uniform traffic scenarios we modify (1) and also the optimization metric, which was based only on distances between cores earlier. Equation (1) is modified as shown below in (9).

$$P_{ij} = \frac{h_{ij} f_{ij}}{\sum_{i,j} h_{ij} f_{ij}}$$
(9)

In (9),  $f_{ij}$  is the frequency of communication between the *i*<sup>th</sup> source and *j*<sup>th</sup> destination. This frequency is expressed as the percentage of traffic generated from *i* that is addressed to *j*. This frequency distribution is based on the particular application mapped to the overall NoC and is hence set prior to wireless link insertion. Therefore, the *apriori* knowledge of the traffic pattern

is used to optimize the WiNoC. This optimization approach establishes a correlation between traffic distribution across the NoC and network configuration as in [31]. The optimization metric, which was just the sum of distances between all possible source and destination pairs of hubs in the previous experiments, needs to be modified to factor in the effect of non-uniform traffic:

$$\mu = \sum_{i,j} h_{ij} f_{ij} .$$
 (10)

where,  $\mu$  is the optimization metric. In this particular case, equal weight is attached to distance as well as frequency of communication in the metric. Using this modified metric the simulated annealing algorithm is used to insert the wireless links for optimized performance. To represent non-uniform traffic patterns we considered both synthetic and application-specific traffic patterns.

We considered two types of synthetic traffic to evaluate the performance of the proposed WiNoC architecture. First, a *transpose* traffic pattern [6] was considered where a certain number of hubs were considered to communicate more frequently with each other. We considered 1, 3 and 5 such pairs and called them *transpose1*, *transpose3* and *transpose5* respectively. The system size considered was 128 with 16 subnets and 4 wireless links. Fifty percent of packets generated from one of these hubs were targeted towards the other in the pair. The other synthetic traffic pattern considered was the *hotspot* [6], where each hub communicates with a certain number of hubs more frequently than with the others. We have considered three such hotspot locations to which all other hubs send 50% of the packets that originate from them. To represent application-based traffic patterns, two scientific applications were mapped onto the 128-core NoC considered here. First, a 256-point fast Fourier transform (FFT) application was considered, wherein each core performs a 2-point radix-2 FFT computation. Secondly, the traffic pattern generated in performing multiplication of two 128x128 matrices was considered.

For all the above non-uniform traffic distributions, the SA algorithm achieves the optimal configuration faster than the exhaustive search, though it takes more iterations than the case with uniform traffic distribution.

Figure 3.14 shows the maximum throughput at network saturation for non-uniform traffic distributions with and without wireless links. For the transpose traffic, the pairs of nodes are



Figure 3.14. Throughput of 128 core WiNoC with various traffic patterns.

chosen along the diagonals of the ring topology of the upper level of the network to incorporate the worst-case effect on the throughput due to non-uniform traffic. Without any long-range wireless links, the throughput of the network decreases with increase in the number of transpose pairs. As these pairs are placed along the diagonals of the ring the amount of multi-hop communication increases and that affects the throughput. However, by inserting the wireless links between these highly communicating hubs, the throughput can be increased significantly. It is evident from figure 3.14 that with increase in the number of highly communicating pairs, insertion of the wireless links brings significantly more performance gain. When the traffic is uniform, insertion of the wireless links improves the performance by around 104%. In the case of non-uniform traffic distribution with 5 transpose pairs, the performance improvement is 243%. The optimization of the network configuration for hotspot traffic also yields similar results where the gain in throughput becomes 209%. For the FFT and the matrix multiplication applications, the percentage improvements in throughput are 209% and 226% respectively. In case of the

application-specific non-uniform traffic patterns, the wireless links are inserted depending on the mutual interactions among various cores of the system giving a preference to more frequently communicating subnets. For each application-specific traffic pattern, the wireless link insertion process finds a corresponding optimum network configuration. Due to this inherent correlation between the optimization of the network configuration and the traffic pattern, the gains in throughput with non-uniform traffic are larger than that with uniform traffic.

#### 3.2.7 Area Overheads

In this section we present a detailed analysis of the area overhead involved in overall wireless deployment in the hierarchical WiNoC. A key advantage of CNT antennas is that they are nanoscale structures with diameters of only a few hundreds of nanometers. The length of the nanotubes required for the WiNoCs vary in the range of a few microns. Hence the areas of the antenna elements themselves are very small. The area of the transceiver circuits required per wireless port is the total area required for the TDM modulator/demodulator, the MZM modulator/demodulator and the LNAs. The area overhead for a link varies depending upon the number of channels per link as the TDM modulator/demodulator complexity changes with number of wireless links. The area complexity increases with increase in the number of wireless links. The area complexity increases with increase in the number of wireless links. The area complexity less than the silicon area of the hubs.

The arbitration, data routing and storage components of the hubs have area cost depending upon the number of ports per hub. The number of wireline ports in turn, varies with the number of cores in the subnets. The number of wireline ports is equal to two more than the number of cores in each subnet due to two neighboring hubs that each hub is connected with,

Table 3.5. Total area overheadof wireless ports

| Number   | Total area          |
|----------|---------------------|
| of       | of wireless         |
| Wireless | ports ( $\mu m^2$ ) |
| links    |                     |
| 1        | 151                 |
| 6        | 5343                |
| 24       | 85443               |

in addition to the cores in the subnet. This area cost is independent of the deployment of the wireless links as the wireless links are just extra ports deployed to the hubs whose area is already

quantified. Figure 3.15 shows the total area cost of WiNoCs of various system sizes with different numbers of subnets. Since flat mesh switches still exist in the WiNoCs the total area is the sum of the areas of the flat mesh switches and the hubs. The additional silicon area overhead due to the hubs is at the most 25% for the system configurations considered in this work. This area cost analysis is



<sup>\*\*</sup> NS = number of subnets, SS = subnet size

Figure 3.15. Total silicon area in WiNoC hubs and switches.

done for hubs which implement the centralized routing strategy adopted throughout the work. However, due to the higher complexity of the routers in this strategy compared to that in the distributed routing strategy the area of each hub is up to 10% higher with the centralized routing than with distributed routing. The core-hub and the inter-hub wireline links are additional wiring overheads, which depend on the number and size of subnets. Figure 3.16 shows the total wiring requirements of various lengths for a 20mm x 20mm die for the various system configurations considered in this work. The wiring requirements for a flat mesh architecture are shown for comparison. There are only a very few 10 mm long links in the 128 core WiNoC and hence is not apparent in the figure. It may be noted



\*\* NS = number of subnets, SS = subnet size of WiNoC Figure 3.16 Wiring requirement of WiNoC

that in the WiNoC there are no inter-subnet direct core to core links as inter-subnet communication occurs through the hubs. Hence, WiNoCs eliminate a number of wireline links along the subnet boundaries which are present in the flat mesh topology.

## 3.3 Conclusions

In this chapter, we propose and evaluate the performance of Wireless Network-on-Chip (WiNoC) architectures used as communication backbones for multi-core systems. By establishing long-range wireless links between distant cores and incorporating small-world network architectures, the WiNoCs are capable of outperforming their more traditional wired counterparts in terms of network throughput, latency, and energy dissipation. With increase in system size, increasing the number of subnets provides an efficient scaling technique without

significantly degrading system performance. The architectural innovations proposed in this work are made possible by the use of low power and high speed wireless links capable of communicating directly between distant parts of the chip in a single hop. The gains in network performance metrics are in part due to the architecture and the rest is due to the adopted high bandwidth, energy efficient wireless links. Optimum placement of the wireless links based on non-uniform traffic distribution improves the performance of the WiNoC significantly more compared to a uniform traffic scenario.

As a part of this work, we evaluated performance of the WiNoC with respect to other emerging NoC architectures for a specific system configuration. It can be concluded that in the future, a hybrid NoC with wired as well as wireless links is one of the possible alternatives that will deliver the target performance of multi-core chips by utilizing the benefits of both.

## 3.4 Reference

- [1] R. Albert and A.-L. Barabasi. "Statistical mechanics of complex networks," Reviews of Modern Physics, 74:47–97, January 2002.
- [2] M. Buchanan. "Nexus: Small Worlds and the Groundbreaking Theory of Networks." Norton, W. W. & Company, Inc, 2003.
- [3] C. Teuscher, "Nature-Inspired Interconnects for Self-Assembled Large-Scale Network-on-Chip Designs," Chaos, 17(2):026106, 2007.
- [4] D. J. Watts and S. H. Strogatz, "Collective dynamics of 'small-world' networks," Nature 393, 440–442, 1998.
- [5] T. Petermann and P. De Los Rios, "Physical realizability of small-world networks," Physical Review E, 73:026114, 2006.
- [6] U. Y. Ogras and R. Marculescu, "It's a Small World After All": NoC Performance Optimization Via Long-Range Link Insertion", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 7, July 2006, pp. 693-706.
- [7] S. Kirkpatrick et al., "Optimization by Simulated Annealing," Science. New Series 220 (45978): 671-680.
- [8] J. Duato et al., Interconnection Networks An Engineering Approach, Morgan Kaufmann, 2002.

- [9] J. Lin et al., "Communication Using Antennas Fabricated in Silicon Integrated Circuits," IEEE Journal of Solid-State Circuits, vol. 42, no. 8, August 2007, pp. 1678-1687.
- [10] M. Fukuda et. al., "A 0.18 μm CMOS Impulse Radio Based UWB Transmitter for Global Wireless Interconnections of 3D Stacked-Chip System," Proc. of International Conference Solid State Devices and Materials, Sept. 2006, pp. 72-73.
- [11] D. Zhao and Y. Wang, "SD-MAC: Design and Synthesis of A Hardware-Efficient Collision-Free QoS-Aware MAC Protocol for Wireless Network-on-Chip", IEEE Transactions on Computers, vol. 57, no. 9, September 2008, pp. 1230-1245.
- [12] G. W. Hanson, "On the Applicability of the Surface Impedance Integral Equation for Optical and Near Infrared Copper Dipole Antennas," IEEE Transactions on Antennas and Propagation, vol. 54, no. 12, December 2006, pp. 3677-3685.
- [13] P. J. Burke et al., "Quantitative Theory of Nanowire and Nanotube Antenna Performance," IEEE Transactions on Nanotechnology, Vol. 5, No. 4, July 2006, pp. 314-334.
- [14] K. Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, vol. 19, 2007, pp. 421-426.
- [15] Y. Huang et al., "Performance Prediction of Carbon Nanotube Bundle Dipole Antennas," IEEE Transactions on Nanotechnology, Vol. 7, No. 3, May 2008, pp. 331-337.
- [16] Y. Zhou et al., "Design and Fabrication of Microheaters for Localized Carbon Nanotube Growth", Proc. of IEEE conference on Nanotechnology, 2008, pp. 452-455.
- [17] A. Shacham et al., "Photonic Network-on-Chip for Future Generations of Chip Multi-Processors," IEEE Transactions on Computers, Vol. 57, no. 9, 2008, pp. 1246-1260.
- [18] M. Freitag, et al., "Hot carrier electroluminescence from a single carbon nanotube," Nano Letters, vol. 4 (6), 2004, pp.1063 -1066.
- [19] T.S. Marinis, et. al., "Wafer level vacuum packaging of MEMS sensors," Proc. of Electronic Components and Technology Conference, 2005. 31 May-3 June 2005, Vol. 2, pp.1081 - 1088.
- [20] B.G. Lee et al., "Ultrahigh-Bandwidth Silicon Photonic Nanowire Waveguides for On-Chip Networks," IEEE Photonics Technology Letters, vol. 20, no. 6, Mar. 2008, pp. 398-400.
- [21] W. M. J. Green et. al., "Ultra-compact, low RF power, 10Gb/s silicon Mach-Zehnder modulator," Optics Express, Vol. 15, No.25, pp. 17106-17113.
- [22] Z. Lu et al., "Connection-oriented Multicasting in Wormhole-switched Networks on Chip," Proc. of IEEE Computer Society Annual Symposium on VLSI, 2006, pp. 205-210.
- [23] P. P. Pande, et al., "Performance Evaluation and Design Trade-offs for Network-on-chip Interconnect Architectures," IEEE Transactions on Computers, Vol. 54, No. 8, August 2005, pp. 1025-1040.

- [24] T. Krishna et al., "NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication," Proceedings of IEEE Symposium on High Performance Interconnects, HOTI, 26-28 August, 2008, pp. 11-20.
- [25] A. Ismail and A. Abidi, "A 3 to 10GHz LNA Using a Wideband LC-ladder Matching Network," Proc. of IEEE International Solid-State Circuits Conference, 15-19 February, 2004, pp. 384-534.
- [26] Circuits Multi-Projects. http://cmp.imag.fr
- [27] E. Mensink et al., "A 0.28pf/b 2gb/s/ch transceiver in 90 nm CMOS for 10 mm on-chip interconnects," Proc. of IEEE Solid-State Circuits Conference, February 2007, pp. 412-413.
- [28] B. Feero and P. P. Pande, "Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation," IEEE Transactions on Computers, Vol. 58, No. 1, January 2009, pp. 32-45.
- [29] M. F. Chang et al., "CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect," Proc. of IEEE International Symposium on High-Performance Computer Architecture (HPCA), 16-20 February, 2008, pp. 191-202.
- [30] D. Vantrease et al., "Corona: System Implications of Emerging Nanophotonic Technology," Proc. of IEEE International Symposium on Computer Architecture (ISCA), 21-25 June, 2008, pp. 153-164.
- [31] P. Bogdan and Radu Marculescu, "Quantum-Like Effects in Network-on-Chip Buffers Behavior," Proc. of IEEE Design Automation Conference, DAC, 4-8 June, 2007, pp. 266-267.

# Chapter 4 Signal Integrity of WiNoC

Aggressive scaling in the nanometer technology nodes enables high degree of integrations in the current and future generations of Multi-Processor System-on-Chips (MP-SoCs) or Chip Multi-Processors (CMPs). However, the shrinking dimensions and use of nanodevices result in inherently unreliable or defect prone designs of NoCs in general. Error Control Coding (ECC) has been proposed for mitigating the inherent reliability issues in on-chip communication [1]. In the following subsections we elaborate how ECC schemes can be applied to enhance the overall reliability of WiNoC architectures. As mentioned earlier, intra-subnet communication in a WiNoC takes place through the wireline links and the wireless channels are utilized for intersubnet data exchanges. Consequently, suitable ECC schemes need to be developed targeting wireline and wireless links.

# 4.1 Error Control Coding for the Wireline Links

It is well known that with shrinking geometry, NoC architectures will be increasingly exposed to different sources of transient noise affecting signal integrity and system reliability. Data-dependent crosstalk between adjacent wires is a major source of such transient noise. Worst case crosstalk happens when the two neighbors transition in opposite directions with respect to the victim wire. With shrinking geometry, the inter-wire spacing decreases rapidly [2] while the height and width of the wires do not scale at the same rate. This in turn tends to increase the cross-sectional aspect ratio, increasing the effective coupling capacitance between intra-layer adjacent wires with negative effects not only on signal integrity but also on delay and energy dissipation. The fact that the dielectric constant does not scale down at the same rate also contributes to the increase in coupling capacitance between adjacent wires in the same metal level. Besides crosstalk, there are several other important sources of transient errors like ground bounce, supply voltage scaling, electromagnetic radiation and alpha particle hits etc. [3], which can cause random data upset. As noted in [4] due to shrinking feature size in future technologies the soft error rate (SER) due to high energy particles is predicted to increase by several orders of magnitude. As these soft errors are not necessarily correlated, higher SER can cause uncorrelated multiple bit errors in data blocks. By incorporating Crosstalk Avoidance Coding (CAC) in NoC data streams the effective coupling capacitance of the wire segments and hence the communication energy can be reduced, as they are linearly related [5]. But CACs are not sufficient to protect the NoC from other transient errors. In the current generation of NoCs, simple single-error correction (SEC) codes are applied to achieve both reliability and low power [6] [7]. But these SECs are not capable of reducing the effective coupling capacitance of the wires of the communication channel. Moreover, with the reduction of feature sizes and powersupply voltages and the increase in operating frequencies, circuits are much more susceptible to transient noise. This results in much higher error-rates that ultimately overwhelm SECs, rendering them insufficient for future NoCs. In this section we discuss some single error detecting/correcting codes for wireline NoC links existing in literature and then propose the design of joint crosstalk avoidance and multiple error correction codes (CAC/MEC) and quantify their performance in making NoC fabrics reliable and energy efficient.

## 4.1.1 Error Detection Scheme

This scheme implements Hamming code for error detection and retransmits if the scheme detects that the flit is in error [8]. As an example, the (38,32) Hamming code implemented for a 32 bit wide flit has double error detection capability and it can reliably detect but not correct, up
to two errors in the flit. The ED scheme only detects the errors; on detection of any error pattern, it sends an automatic repeat request (ARQ) signal for retransmission of the flit. The encoder is essentially only a (38, 32) Hamming encoding block. The decoder is also a standard syndrome decoder for the Hamming encoded flit. Evidently, this scheme does not have any crosstalk avoidance properties.

## 4.1.2 Duplicate Add Parity and Modified Dual Rail Code

The Duplicate Add Parity (DAP) code is a joint coding scheme that uses duplication to reduce crosstalk [9]. Duplication results in reducing the crosstalk induced coupling capacitance from the worst case switching capacitance of a wire segment from  $(1+4\lambda)C_L$ , to  $(1+2\lambda)C_L$ . Also, by duplication, we can achieve Hamming distance of two, and with the addition of a single parity bit, the Hamming distance [10] increases to three. Consequently, DAP has single error correction capability. The DAP encoder and decoder are shown in figures 4.1(a) and (b) respectively. Encoding involves calculating the parity and duplicating the bits of the incoming word. Similarly, in decoding, the parity bit is recreated from one set of the data flit. As shown in figure 4.1(b), bit  $y_8$  is the previously-calculated parity, and the other signal entering the exclusive-or gate is the newly-calculated parity of the more significant set (bits y<sub>1</sub>, y<sub>3</sub>, y<sub>5</sub>, and y<sub>7</sub>). The new parity is compared with the original parity calculated in the encoder, and the error-free set is chosen. For example, in case of an error in the more significant set, the parities will differ, and the less significant set will be chosen as the decoded flit. On the other hand, if the error occurs in less significant set, the more significant set will be chosen. Thus, considering a link of k information bits, m = k + 1 check bits are added, leading to a code word length of n = k + m = k + m2*k* + 1.

We define the k + 1 check bits with the following equations:



Figure 4.1. (a) Duplicate Add Parity (DAP) encoder (b) decoder

$$c_i = d_i, \text{for } i = 0 \text{ to } k - 1$$
$$c_k = d_0 \oplus d_1 \oplus \dots \oplus d_{k-1}$$

The *Modified Dual Rail (MDR)* code is very similar to the DAP [11]. In the MDR code, two copies of parity bit  $C_k$  are placed adjacent to the other codeword bits in order to reduce crosstalk.

# 4.1.3 Boundary Shift Code

The *Boundary Shift Code (BSC)* coding scheme attempts to reduce crosstalk-induced delay by avoiding a shared boundary between successive codewords. As shown in [12] this techniques achieves a reduction in the worst case crosstalk induced switching capacitance from  $(1+4\lambda)C_L$  to  $(1+2\lambda)C_L$ . It is very similar to DAP in that it uses duplication and one parity bit to achieve crosstalk avoidance and single-error correction. However, the fundamental difference is that at each clock cycle, the parity bit is placed on the opposite side of the encoded flit. In BSC, the dependent boundaries are the boundaries between encoded bits. Refer to table 4.1, which shows examples of different code words with parity bits in bold. In clock cycle 1, dependent boundaries exist between bits  $y_0$  and  $y_1$ ,  $y_2$  and  $y_3$ ,  $y_4$  and  $y_5$ , and  $y_6$  and  $y_7$ . Inversely, in the second clock cycle, dependent boundaries are between bits  $y_1$  and  $y_2$ ,  $y_3$  and  $y_4$ ,  $y_5$  and  $y_6$ , and  $y_7$  and  $y_8$ . As can be seen in table 4.1, this coding scheme does not allow dependent boundaries in subsequent codewords. Encoding is achieved by duplicating bits and completing a parity calculation as in DAP. However, every second clock cycle will result in a one-bit shift. Similarly, the decoding structure is equivalent to that of DAP with the addition of a one-bit shift every other clock cycle before the parity check. Figures 4.2(a) and 4.2(b) depict the encoder and decoder respectively.

| Table 4.1. Coded fift structure for different coding schemes |      |                    |                   |                    |  |  |  |  |  |  |  |  |
|--------------------------------------------------------------|------|--------------------|-------------------|--------------------|--|--|--|--|--|--|--|--|
| Clock Cycle                                                  | Flit | BSC                | DAP               | MDR                |  |  |  |  |  |  |  |  |
| 1                                                            | 0010 | <b>1</b> 00001100  | 100001100         | <b>11</b> 00001100 |  |  |  |  |  |  |  |  |
| 2                                                            | 0010 | 00001100 <b>1</b>  | 100001100         | <b>11</b> 00001100 |  |  |  |  |  |  |  |  |
| 3                                                            | 1100 | <b>0</b> 11110000  | <b>0</b> 11110000 | <b>00</b> 11110000 |  |  |  |  |  |  |  |  |
| 4                                                            | 1010 | 11001100 <b>1</b>  | <b>1</b> 11001100 | <b>11</b> 11001100 |  |  |  |  |  |  |  |  |
| 5                                                            | 0100 | <b>1</b> 00110000  | 100110000         | 1100110000         |  |  |  |  |  |  |  |  |
| 6                                                            | 0011 | 000011111 <b>0</b> | <b>0</b> 00001111 | <b>00</b> 00001111 |  |  |  |  |  |  |  |  |

One of the principal differences between the CAC schemes and the joint codes is that for the

.....

joint codes we do not have to do divide the whole link into different sub-channels and then perform partial coding. We can perform DAP/BSC/MDR coding/decoding on the link as a whole.

As a part of this research, we have proposed a family of joint crosstalk avoidance and multiple error codes which are described below.

# 4.1.4 Crosstalk Avoidance Double Error Correction Code

The Crosstalk Avoidance Double Error Correction Code (CADEC) is a joint coding scheme that performs crosstalk avoidance and double error correction simultaneously [13]. It achieves crosstalk avoidance by duplication of the bits. The same technique also increases the minimum hamming distances between codewords enabling a higher error correction capability.

# **CADEC Encoder**

The encoder is a simple combination of Hamming coding followed by DAP or BSC encoding to provide protection against crosstalk. As shown in figure 4.3(a), the incoming 32-bit flit is first encoded using a standard (38, 32) shortened Hamming code, and then each bit of the 38-bit Hamming codeword is duplicated and appended with a parity. The (38, 32) Hamming code has a



Figure 4.2: (a) BSC Encoder; (b) BSC Decoder.



Figure 4.3. (a) CADEC Encoder. (b) CADEC Decoder

Hamming distance of 3 between adjacent code words. On duplication this becomes 6 and after adding the extra parity bit this distance becomes 7. A Hamming distance of 7 enables triple error correction, but at a somewhat higher complexity cost than the double-error correcting schemes considered here. Consequently as a first step we considered only the double error correction capability. The extra parity bit, which is a part of DAP or BSC schemes, is added to make the decoding process very energy efficient as explained below.

#### **CADEC** Decoder

The decoding procedure for the CADEC encoded flit can be explained with the help of the flow diagram shown in figure 4.4. The decoding algorithm consists of the following simple steps:

(i) The parity bits of the individual Hamming copies are calculated and compared with the sent parity;

(ii) If these two parities obtained in step (i) differ, then the copy whose parity matches with

the transmitted parity is selected as the output copy of the first stage.

(iii) If the two parities are equal, then any one copy is sent forward for double error detection(DED) by the (38, 32) Hamming Syndrome detection block.

(iv) If the syndrome from the DED block obtained for this copy is zero then this copy is selected as the output of the first stage. Otherwise, the alternate copy is selected.

(v) The output of the first stage is sent for (38, 32) single error correcting Hamming decoding, finally producing the decoded CADEC output

The circuit implementing the decoder is schematically shown in figure 4.3(b).

The use of the DAP or BSC parity bit actually makes the decoder more energy efficient,



Figure 4.4. CADEC decoding algorithm

compared to a scheme without the parity bit, which always requires a syndrome to be computed on both copies. When the parity bits generated from individual Hamming copies fail to match, the DED-syndrome block need not be used at all, thus on average making the overall decoding process more energy efficient. This situation arises when there is single error in either one of the two Hamming copies, which, generally, will be the most probable case. We note that the circuit diagram of figure 4.3(b) and the flowchart of figure 4.4 show only the logic for error correction.

## 4.1.5 Joint Crosstalk Avoidance and Triple Error Correction Code

Aggressive scaling of device dimensions and the consequent increase in vulnerability to transient errors makes exploration of multiple error correcting codes imperative. However, higher order error correcting codes alone are not enough to ensure reliable performance of NoCs in the current and future technology nodes. Crosstalk avoidance must be made an integral part of any multiple error correction schemes. An important point to note here is that the proposed joint CAC/MEC scheme is not just the design of another multiple error correcting code, but one that reduces worst case crosstalk as well with little computational complexity. It has been shown in [9] that only a linear CAC can be implemented after any error control coding scheme to enable error correction and crosstalk avoidance simultaneously. Furthermore it has been proven that to achieve maximum possible reduction in crosstalk there is no linear coding scheme with fewer wires than duplication [9]. Below, we propose a simple combined crosstalk avoiding triple error correction scheme called Joint Crosstalk Avoidance and Triple Error Correction (JTEC) code.

## **JTEC Encoder**

The encoder for the JTEC scheme utilizes the facts that the minimum Hamming distance between any two codewords of Single Error Correcting (SEC) Hamming code is 3 and also that duplication avoids worst case crosstalk between adjacent wires. First the information bits, say k in number, are encoded with SEC Hamming code. Then each of these Hamming encoded bits is duplicated. Finally, an overall parity bit, calculated from either one of the Hamming copies, is appended to the encoded bits. Thus if the initial SEC



Figure 4.5. JTEC encoder schematic.

Hamming code was an (n, k) code the final number of bits in the encoded bit is 2n+1. For example, if the original information word consisted of 32 bits then after encoding with a SEC (38, 32) shortened Hamming code it becomes 38 and after the duplication and addition of the overall parity bit it becomes 77. Thus for an uncoded 32 bit wide flit, JTEC is a (77, 32) coding scheme. The Hamming distance of the (38, 32) SEC Hamming codes is 3. The duplication process increases this to 6 and addition of an overall parity bit makes the final minimum Hamming distance between the codewords to be 7. Thus this enables triple error correction. The duplication simultaneously serves to avoid opposite bit transitions in adjacent wires so that the worst case transition of a bit pattern from 101 to 010 and vice-versa can be avoided. Consequently the worst case effective crosstalk capacitance of a wire segment of the communication channel can be reduced from  $(1+4\lambda)C_L$  to  $(1+2\lambda)C_L$ . The encoding mechanism for the JTEC code is shown in figure 4.5 through a schematic diagram.

#### **JTEC Decoder**

The decoder for this scheme requires syndrome computation on the two copies and

comparisons of the transmitted overall parity bit with the locally generated parities recomputed at the decoder from each individual copy. The algorithm for the JTEC decoder is shown through a flowchart in figure 4.6(a) and is outlined below:

- The two Hamming copies A, B and the transmitted overall parity bit P<sub>0</sub> are isolated. Also, two parity bits are calculated separately from A and B, say P<sub>A</sub> and P<sub>B</sub>.
- If the syndrome of copy  $A(S_A)$  is non-zero then it implies that it can have 1 or 2 errors.
  - $\circ$  Now, if P<sub>0</sub> is equal to P<sub>A</sub> then it means A has 2 errors and B can have at the most a single error. So, copy B is chosen for the final SEC Hamming decoding stage



Figure 4.6. Flowcharts for the decoding schemes for (a) JTEC and (b) Optimized JTEC

which will correct this single error.

- However, if  $P_0$  is not equal to  $P_A$  then the syndrome of copy B (S<sub>B</sub>) is computed and copy B is chosen if S<sub>B</sub> is zero or copy A is chosen if S<sub>B</sub> is non-zero as A has a single error then.
- If the syndrome of copy A was zero then A can have none or 3 errors.
  - $\circ$  In this case if P<sub>0</sub> is the same as P<sub>A</sub> then copy A is chosen.
  - But if the two parity bits do not match then the syndrome of copy B is computed and if it is non-zero then copy A is chosen. Copy B is chosen if the syndrome is zero.

The final chosen copy is sent for Single Error Correcting Hamming decoding to produce the triple error corrected output.

Both the encoding and the decoding processes discussed above essentially necessitate the use of long chains of XOR gates to compute the overall parity bits. This happens because the overall parity bits are modulo-2 summation of all the Hamming encoded bits. Thus for large flit widths, this may imply prohibitively complex hardware with negative effect on energy dissipation and timing. The hardware complexity and critical path delay of the codec block can be reduced by adopting an optimization method as outlined in the next subsection.

## **Optimization of the Code**

Both the encoder and the decoder for the JTEC scheme use long chains of XOR gates. The complexity of both the circuits can be optimized by using a two-fold approach. First, the overall parity bit in conjunction with one of the (n, k) Hamming coded copies is used as a (n+1, k) single error correction and double error detection (SEC-DED) code. For the specific example of 32 original information bits, the (38, 32) Hamming coded bits become (39, 32) SEC-DED code after

appending the overall parity. This modification is shown in figure 4.7. A syndrome computation on this SEC-DED code can be used to indicate a single or a double error in those 39 bits. If there is a single error then it can be corrected using the syndrome. If there are 2 errors in these 39 bits then the other copy can not have more than a single error for a triple error correction code to be able to correct the error pattern. This can then be corrected by the syndrome computation on that copy. If the first 39 SEC-DED bits have all the three errors then this triple error can not be corrected by the SEC-DED code but then the other copy will be error free and can be accepted. This algorithm is explained through a flowchart in figure 4.6(b). This modified decoding approach reduces hardware complexity considerably as the step of locally recomputing the overall parity bits  $P_A$  and  $P_B$  are avoided. Also, the last step of a Hamming SEC decoding becomes redundant in the optimized scheme. Thus the decoding circuit can be simplified by this step.

The second level of optimization consists of replacing the (39, 32) Hamming SEC-DED with the (39, 32) Hsiao SEC-DED code [14]. The last parity bit of the Hamming SEC-DED scheme is



Figure 4.7. JTEC encoder schematic.

basically an overall parity bit computed as the XOR sum of all the 38 bits of the Hamming encoded flit. This is indicated by the last row of the H-matrix for the Hamming SEC-DED code in figure 4.8(a) which has all '1' entries. However, if the Hamming SEC-DED is replaced by the Hsiao SEC-DED code then the number of XOR gates required to compute any of the parity bits can be restricted to the average number of XOR gates for all the 7 parity bits [14]. For the (39, 32) Hsiao Code this average number of XOR gates turn out to be 14.7 and hence some of the 7 parity bits need 14 and others 15 XOR gates as shown by the H-matrix for this scheme in figure 4.8(b). Consequently, the number of XOR gates can be drastically reduced by using Hsiao Code instead of Hamming SEC-DED and the delays along the critical paths of both the encoder and decoder are also reduced as they do not have long chains of 38 XOR gates any more.

Another important point to be noted here is that the second copy which was originally a duplicated (38, 32) Hamming SEC code will now just be a duplication of the 38 bits from (39, 32) Hsiao code including the 32 original information bits and any 6 of the 7 parity bits generated by the Hsiao coding. It is shown in Appendix I that these 38 bits will still have single error correction capability, which is vital for the overall triple error correction as discussed earlier.

This two-fold approach reduces the delay and hardware requirements for not only the decoder but also the encoder. The encoder now will have to encode using the generator matrix of

|                               | ſ | 1 | 0 | 1 | 0 | 1 | 0   | 1    | 0 | 1   | 0   | 1 | 0 | 1 | 0   | 1 | 0 | 1 | 0 | 1 | 0            | 1 | 0 | <br>1 | 0 | 1 | 0 | 1 | 0 | 1 | 0] |
|-------------------------------|---|---|---|---|---|---|-----|------|---|-----|-----|---|---|---|-----|---|---|---|---|---|--------------|---|---|-------|---|---|---|---|---|---|----|
|                               |   | 0 | 1 | 1 | 0 | 0 | 1   | 1    | 0 | 0   | 1   | 1 | 0 | 0 | 1   | 1 | 0 | 0 | 1 | 1 | 0            | 0 | 1 | <br>0 | 1 | 1 | 0 | 0 | 1 | 1 | 0  |
|                               |   | 0 | 0 | 0 | 1 | 1 | 1   | 1    | 0 | 0   | 0   | 0 | 1 | 1 | 1   | 1 | 0 | 0 | 0 | 0 | 1            | 1 | 1 | <br>0 | 0 | 0 | 1 | 1 | 1 | 1 | 0  |
| $H_{\text{HammingSEC-DED}} =$ | = | 0 | 0 | 0 | 0 | 0 | 0   | 0    | 1 | 1   | 1   | 1 | 1 | 1 | 1   | 1 | 0 | 0 | 0 | 0 | 0            | 0 | 0 | <br>1 | 1 | 1 | 1 | 1 | 1 | 1 | 0  |
|                               |   | 0 | 0 | 0 | 0 | 0 | 0   | 0    | 0 | 0   | 0   | 0 | 0 | 0 | 0   | 0 | 1 | 1 | 1 | 1 | 1            | 1 | 1 | <br>1 | 1 | 1 | 1 | 1 | 1 | 1 | 0  |
|                               |   | 0 | 0 | 0 | 0 | 0 | 0   | 0    | 0 | 0   | 0   | 0 | 0 | 0 | 0   | 0 | 0 | 0 | 0 | 0 | 0            | 0 | 0 | <br>0 | 0 | 0 | 0 | 0 | 0 | 0 | 1  |
|                               |   | 1 | 1 | 1 | 1 | 1 | 1   | 1    | 1 | 1   | 1   | 1 | 1 | 1 | 1   | 1 | 1 | 1 | 1 | 1 | 1            | 1 | 1 | <br>1 | 1 | 1 | 1 | 1 | 1 | 1 | 1  |
| (a)                           |   |   |   |   |   |   |     |      |   |     |     |   |   |   |     |   |   |   |   |   |              |   |   |       |   |   |   |   |   |   |    |
| Γ                             | 1 | 1 | 1 | 1 | 1 | 1 | . 1 | . 1  | ( | ) ( | ) ( | 0 | 0 | 0 | 0   | 1 | 0 | 0 | 0 | 0 | 1            | 0 | 0 | <br>1 | 1 | 0 | 0 | 0 | 0 | 0 | 0] |
|                               | 0 | 0 | 0 | 0 | 1 | 0 | ) ( | ) 1  | 1 |     | 1   | 1 | 1 | 1 | 1   | 1 | 1 | 0 | 0 | 1 | 0            | 0 | 1 | <br>0 | 0 | 1 | 0 | 0 | 0 | 0 | 0  |
|                               | 0 | 0 | 0 | 1 | 0 | 0 | ) ( | 0    | ( | ) ( | ) ( | 0 | 1 | 0 | 0   | 0 | 0 | 1 | 1 | 1 | 1            | 1 | 1 | <br>0 | 0 | 0 | 1 | 0 | 0 | 0 | 0  |
| $H_{Heiges EC_{-}DED} =$      | 0 | 0 | 1 | 0 | 0 | 0 | ) 1 | 0    | ( | ) ( | )   | 1 | 0 | 0 | 1   | 0 | 1 | 1 | 0 | 0 | 0            | 0 | 0 | <br>1 | 0 | 0 | 0 | 1 | 0 | 0 | 0  |
| IISMOSEC-DED                  | 0 | 1 | 1 | 0 | 0 | 1 | . 1 | 1    | ( | )   | 1 ( | 0 | 1 | 0 | 0   | 0 | 1 | 1 | 1 | 1 | 0            | 1 | 1 | <br>0 | 0 | 0 | 0 | 0 | 1 | 0 | 0  |
|                               | 1 | 0 | 0 | 0 | 0 | 1 | 1   | 0    | 1 | (   | ) ( | 0 | 0 | 1 | 1   | 1 | 0 | 1 | 1 | 1 | 1            | 1 | 0 | <br>0 | 0 | 0 | 0 | 0 | 0 | 1 | 0  |
|                               | 1 | 1 | 0 | 1 | 1 | ſ | ) ( | ) () | 1 |     | 1   | 1 | 1 | 0 | 0   | 0 | Õ | 0 | 1 | 0 | 0            | 0 | 0 | <br>1 | 0 | 0 | 0 | 0 | 0 | 0 | 1  |
| L                             |   | 1 | 0 | 1 | 1 | C | , ( | , 0  |   |     | •   |   |   | Ŭ | (b) | Ŭ | v | v |   | Ū | <sup>°</sup> | 0 | 0 |       | 5 | 0 | 5 | 0 | 0 | 0 | 1  |

Figure 4.8. H-matrix for (a) (39, 32) Hamming SEC-DED code and (b) (39, 32) Hsiao SEC-DED code

the Hsiao Code which has either 14 or 15 XOR gates for each parity bit, unlike the Hamming SEC-DED code which used an overall parity bit using 38 such gates for the seventh parity bit. Though the above optimization technique is explained with the specific example of the (39, 32) Hsiao SEC-DED code, the principle generally holds for flits of all lengths as in essence, this optimization methodology uses the fact that the Hsiao SEC-DED code is more optimized in terms of hardware complexity compared to the standard Hamming SEC-DED.

# 4.1.6 Joint Triple Error Correction and Simultaneous Quadruple Error Detection

The JTEC scheme explained above can be modified to achieve simultaneous triple error correction and quadruple error detection to detect all uncorrectable error patterns in case there are any. Thus the Joint Triple Error Correction and Simultaneous Quadruple Error Detection code (JTEC-SQED) can correct up to all 3-error patterns on the fly as well as detect all 4-error patterns that can not be corrected by the JTEC scheme alone. The modification and associated overheads are discussed in the following subsection.

#### **JTEC-SQED** Encoder

The encoder uses the Hsiao SEC-DED code of appropriate size to achieve simultaneous triple-error correction and quadruple-error detection. The original information bits are first encoded according to Hsiao SEC-DED where the minimum Hamming distance between codewords becomes 4. Then all the encoded bits are duplicated to increase the Hamming distance to 8 which will enable detection of quadruple error patterns. This code will also have the same crosstalk avoidance capability as the JTEC. Hsiao SEC-DED is used because of the advantages in optimization mentioned in Section 4.1.5. Essentially, the encoded flit now contains

two Hsiao SEC-DED copies. The JTEC-SQED scheme achieves simultaneous triple error correction and quadruple error detection, as it differs from JTEC only in appending a second copy of the last parity bit of the Hsiao SEC-DED code to the JTEC bits, preserving all the bits necessary for the JTEC decoding scheme.

#### **JTEC-SQED** Decoder

The decoder needs to set a flag whenever it encounters a 4-error pattern that can not be corrected by the triple error correcting algorithm. Below we discuss the several cases that may lead to this and how each of the cases can be detected.

• When each of the two Hsiao SEC-DED encoded copies have double errors, then the syndromes of both copies will be able to detect the presence of such double error patterns.

• When there is a single error in one copy and a triple error in the other, the triple error pattern in the Hsiao SEC-DED code will always give an odd-weight syndrome; this fact is proved in Appendix II. The syndromes are used to decode each individual copy. If both decoded copies do not match then there must have been a triple error in one of the copies indicating an overall quadruple error pattern.

• The only other possibility is when there are 4 errors in one copy and none in the other. In that case, the syndrome of the erroneous copy can be either zero, if the errors make it another Hsiao codeword, or non-zero. If it is zero then the copies will be different indicating a quadruple error pattern. If the syndrome of the erroneous copy is non-zero then the JTEC decoding algorithm will be able to select the correct copy.

The JTEC-SQED scheme simultaneously corrects triple errors and detects quadruple error patterns with additional hardware as compared to the JTEC scheme alone. The result of the triple

73

error correction has to be discarded if a quadruple error pattern is detected because that result maybe inaccurate if there is a quadruple error pattern in the flit. In the following subsection a performance evaluation of the above ECC schemes are presented so that the one with the best performance can be selected to be used on the wireline links of the WiNoC.

## 4.1.7 Performance evaluation of the ECC schemes in a wireline NoC

To evaluate the performance of the ECC schemes on NoC platforms first we evaluate their error correcting capability and residual flit/word error rates. Increase in reliability by incorporating coding can be translated into a reduction in voltage swing on the interconnect wires as they can tolerate lower noise margins. The residual word error rate is used to compute the reduced voltage swing that can be used with each scheme. Then the corresponding energy savings due to the reduced swing and reduced crosstalk coupling is presented for a completely wireline NoC consisting of 64 cores. The performance evaluation is done in terms of three principal metrics: energy savings, area overhead and timing. Messages were injected with a Self-Similar temporal distribution for the sake of simulation of a real NoC environment. The routing mechanism used for the MESH was the e-cube (dimension order) routing. Simulations were performed using 65nm technology node parameters. The codec blocks were synthesized with the CMP [15] standard cell libraries. All the three different metrics are discussed in the following subsections.

#### Voltage Swing Reduction Due to Increased Reliability

Incorporation of error control coding enhances the reliability of the communication channel as it becomes robust against transient malfunctions. In the Ultra Deep Submicron (UDSM) technology nodes reliability and energy dissipation are two inseparable issues. Increase in reliability by incorporating coding can be translated into a reduction in voltage swing on the interconnect wires as they can tolerate lower noise margins. Hence this results in savings in energy dissipation as it depends quadratically on the voltage swing. In this section we quantify these gains by modeling the voltage swing reduction as a function of increased error correction capability.

The cumulative effect of all transient UDSM noise sources can be modeled as an additive Gaussian noise voltage  $V_N$  with variance  $\sigma_N^2$  [9]. Using this model, the bit error rate (BER),  $\varepsilon$  depends on the voltage swing  $V_{dd}$  according to the following relation:

$$\varepsilon = Q \; \frac{V_{dd}}{2\sigma} \; | \,, \tag{4.1}$$

where the *Q*-function is given by

$$Q(x) = \frac{1}{\sqrt{2\pi}} \int_{x}^{\infty} e^{-\frac{y^2}{2}} dy$$
(4.2)

The word error probability is a function of the channel BER  $\varepsilon$ . If  $P_{UNC}(\varepsilon)$  is the residual probability of word error in the uncoded case and  $P_{ECC}(\varepsilon)$  is the residual probability of word error with error control coding, then it is desirable that  $P_{ECC}(\varepsilon) \leq P_{UNC}(\varepsilon)$ . Using equation 4.1, we can reduce the supply voltage in presence of coding to  $\hat{V}_{dd}$ , given by [9]

$$\hat{V}_{dd} = V_{dd} \frac{Q^{-1}(\hat{\varepsilon})}{Q^{-1}(\varepsilon)}.$$
(4.3)

In equation 4.3,  $V_{dd}$  is the nominal supply voltage in the absence of any coding,  $\hat{V}_{dd}$  is the reduced voltage swing with coding and  $\hat{\varepsilon}$  is the BER such that

$$P_{ECC}(\hat{\mathcal{E}}) = P_{UNC}(\mathcal{E}). \tag{4.4}$$

Use of lower voltage swing makes the probability of multi-bit error patterns higher, necessitating

the use of multiple error correcting codes in order to maintain the same word error probability as the uncoded case. To compute  $\hat{V}_{dd}$  for various coding schemes with different error correction capability the residual word error probability,  $P_{ECC}(\varepsilon)$  for each of the schemes need to be computed. In the following subsections we compute the residual word error probability for the ED, DAP, CADEC, JTEC and the JTEC-SQED schemes.

#### **Residual Probability of Word Error**

To compute the possible voltage swing reduction in presence of the codes we compute the residual probability of word errors for these schemes. The probability of word error can be easily computed by first calculating the probability of correct decoding. The set of correctly decoded words is always complementary to the set of residual word errors. Hence the residual word error probability can be computed using the equation below:

$$P_{ECC} = 1 - P_{correct} \,, \tag{4.5}$$

where  $P_{ECC}$  is the residual word error probability in presence of coding and  $P_{correct}$  is the probability of correct decoding.

#### ED

As pointed out in [10], any (n, k) linear code can detect  $2^n - 2^k$  error patterns of length n. The probability of undetected error for any (n, k) linear code can be computed from the weight distribution polynomial of the code, A (z), given by

$$A(z) = A_0 + A_1 z + \dots + A_n z^n, (4.6)$$

where  $A_k$  is the number of codewords with weight (i.e., the number of 1s in the codeword) equal to *k*. The dual of the linear code also has an associated weight distribution, B(z), given by

$$B(z) = B_0 + B_1 z + \dots + B_n z^n.$$
(4.7)

The weight distribution of the original code and its dual code are related by [10]

$$A(z) = 2^{-(n-k)} (1+z)^n B \left[ \frac{1-z}{1+z} \right].$$
(4.8)

The probability of undetected word error  $P_{ED}(\mathcal{E})$  for an error detection scheme using a linear code with dual weight distribution B(z) is [10]

$$P_{ED}(\varepsilon) = 2^{-(n-k)} B(1-2\varepsilon) - (1-\varepsilon)^n , \qquad (4.9)$$

where  $B(1-2\varepsilon)$  is given by

$$B(1-2\varepsilon) = \sum_{i=0}^{n} B_i (1-2\varepsilon)^i$$
(4.10)

The ED scheme proposed in [8] uses the (38, 32) shortened Hamming code for error detection, so the coefficients  $B_i$  in equation 4.10 are obtained by using the H-matrix of that code. Using equation 4.5, the probability of undetected error for the ED code, for small values of BER  $\varepsilon$ , turns out to be

$$P_{ED}(\varepsilon) = (n-k)\varepsilon^2 \tag{4.11}$$

where n=38 and k=32 for the (38,32) shortened Hamming code.

#### DAP

Since, DAP, BSC and MDR all are joint crosstalk avoidance single error correction code we only consider DAP in the subsequent comparisons. To compute the residual probability of word error for the DAP scheme let us call the two copies of the original data bits as A and B shown in figure 4.9 and let us suppose that the parity in the decoder is regenerated from the copy A.

Then for error free decoding if A is error free then the parity sent should also be error free to enable correct decoding. So, the probability of error free decoding with no errors in A is given by,



Figure 4.9. JTEC encoded bits.

$$P_{A} = \sum_{i=0}^{k} {\binom{k}{i}} \mathcal{E}^{i} (1 - \mathcal{E})^{2k+1-i}$$
(4.12)

If on the other hand copy B is error free and has to be selected for correct decoding then the ex-or operation between the received parity and the regenerated parity must be 1 which is possible only when the number of errors occurring in the k+1 bits of A and the received parity is odd. This event has a probability given below as

$$P_{B} = \sum {\binom{k+1}{2i+1}} \varepsilon^{2i+1} (1-\varepsilon)^{2k-2i}$$
(4.13)

Therefore, the probability of error is given by

$$P_{DAP} = 1 - P_A - P_B \tag{4.14}$$

For small probabilities of bit errors,  $\varepsilon$  the higher order terms are ignored and the residual probability for DAP can be approximated as

$$P_{DAP} \approx \frac{3k(k+1)}{2}\varepsilon^2 \tag{4.15}$$

# CADEC

The probability of correct decoding can be found by considering each of the cases where the decoder can correctly decode flits despite errors. The cases where the decoder can correctly decode words with more than two errors also need to be considered. The complement of the set of correctly decoded words constitutes the set of undetected errors. This probability is given by  $P_{CADEC}$  ( $\varepsilon$ ). So, we have the relation:

$$P_{CADEC}(\varepsilon) = 1 - P_{correct}.$$
(4.16)

In the following derivation, the width of the original flit is denoted by k, where k is 32, which is first Hamming coded to 38 bits, denoted by n. Each bit of the n-bit Hamming codeword is duplicated and an overall parity bit is appended. All possibilities of correct decoding are broadly divided into three categories:

## (i) Error-free transmitted parity bit:

One of the copies has no error while the other has anywhere from zero to all bits in error. This can be correctly decoded similarly as in the DAP scheme which is integrated into the novel CADEC scheme.

(ii) Single bit error in each copy:

There is a single error in both copies, irrespective of the parity-bit being in error or not.

(iii) Erroneous transmitted parity bit: There are multiple cases under this scenario

- no errors in either copy;
- up to one error in one copy and an even number of errors in the other starting from 2 to *n* errors;
- a single error in one copy and an odd number of errors in the other.

The complete probability of correct decoding,  $P_{correct}$  is given by the sum of the probabilities corresponding to the above mutually exclusive cases. In the limit of small channel BER  $\varepsilon$ , this can be expressed as

$$P_{correct} = 1 - n^2 (n - 4) \varepsilon^3.$$
(4.17)

From (4.16) and (4.17), the word error probability is

$$P_{CADEC}(\varepsilon) = n^2 (n-4)\varepsilon^3 .$$
(4.18)

# JTEC

The JTEC coding scheme is capable of correcting up to 3 errors in a single flit. Taking into consideration all the cases where correct decoding is possible the residual error probability of the coding scheme is computed. The formulations below hold for any flit of k information bits which are first coded by (n+1, k) Hsiao SEC-DED into n+1 bits and then only n bits are duplicated to make the total encoded flit 2n+1 bits wide.

Correct decoding in case of JTEC is possible when the count of errors in the entire flit is 3 or less. It might also be able to correct some higher number of errors. Thus the lower bound on the probability of correct decoding,  $P_{correct}$  is given by

$$P_{correct} \ge P(2n+1,0) + \dots + P(2n+1,3).$$
(4.19)

Where, the probability of *m* errors in *n* bits with a BER of  $\varepsilon$  is given by:

$$P(n,m) = \binom{n}{m} \mathcal{E}^m (1-\mathcal{E})^{n-m} .$$
(4.20)

Therefore the probability of residual word error is given in accordance with equation 4.16, using  $P_{correct}$  for the JTEC scheme from equation 4.19. For small values of  $\varepsilon$ , this probability can be approximated as

$$P_{JTEC} = \binom{2n+1}{4} \mathcal{E}^4.$$
(4.21)

#### **JTEC-SQED**

To compute the residual word error probability for the JTEC-SQED scheme let us assume that the total number of bits in the flit is 2n+2, where there are 2 copies of (n+1, k) SEC-DED code. Since JTEC-SQED can either correct or detect up to four errors, the lower bound on the probability of correct decoding can be obtained as

$$P_{correct} \ge P(2n+2,0) + \dots + P(2n+2,4).$$
(4.22)

Using equation 4.16 and equation 4.22 the residual word error probability of the JTEC-

SQED scheme for small values of  $\varepsilon$  can be approximated as:

$$P_{SOEDR} = {\binom{2n+2}{5}} \varepsilon^5.$$

$$(4.23)$$

Using equation 4.3 and the residual probability of word errors the voltage swing reduction for the proposed schemes can be computed. Figure 4.10 shows the reduction is voltage swing,  $\hat{V}_{dd}$  as a function of word error

probability for all the ECC schemes discussed above. The nominal voltage swing was assumed to be 1V.

As the error correction capability of the coding scheme increases the residual word error probability commensurately decreases. Hence, the voltage swing can also be



Figure 4.10. Plot of voltage swing reduction as a function of word error rates.

reduced. Consequently JTEC and JTEC-SQED can achieve more voltage reduction than the existing schemes. However, the voltage swing can not be reduced to arbitrarily low values by increasing the error-correction capability of the code due to the saturating nature of the inverse-Q function used in equation 4.3. Figure 4.11 depicts the reduction in voltage swing against the error correction capability of the codes using the model described in (1) through (3). The value of the word error rate chosen for this plot is  $10^{-20}$  [9]. The plot is made by considering the fact that the residual probability of word error of any ECC is proportional to  $\varepsilon^{j+1}$ , where *j* is the error correcting capability of the corresponding code.

According to figure 4.11, the achievable reduction in voltage swing shows an asymptotic

trend as the correction capability of the code is increased. For example, the difference in voltage swing between triple and quintuple error correction is much less that that between single and triple. As the voltage swing reduction along the wire segments is the predominant source of energy savings in the NoC, beyond the quadruple error correction/detection code the energy dissipation in the codecs may overshadow

1

0.9

5

the savings in the interconnects. Hence it may not be advantageous to use arbitrarily high order error correction codes.

It should be noted that well-known 0.4 1 2 3 4 Error Correction Capability 0 multiple error correcting codes (MEC) like Figure 4.11. Voltage Swing Reduction as BCH codes have no inherent crosstalk a function of error correction capability. avoidance properties. Single error correcting BCH codes are equivalent to the SEC Hamming codes used in JTEC. On the other hand, MEC BCH codes have substantially higher parity bit overhead requirements than the Hamming codes employed in JTEC. Hence, implementation of a linear CAC (e.g., duplication) on BCH codewords would require significantly more parity overhead than JTEC and JTEC-SQED, though it would provide more than triple error correction. Furthermore, MEC BCH codes have substantially higher decoding complexity than the SEC Hamming codes [10]. But in figure 4.11 it is shown that there is a diminishing return on the amount of voltage swing reduction achievable for a given error correction capability t, and that very small reductions occur for values of t > 4. Since voltage swing reduction is the main cause of energy savings in CAC/MEC schemes, a linear BCH-based CAC/MEC scheme could actually increase the energy dissipation, due to the increased parity and computational requirements of BCH codes. Consequently, linear BCH-based CAC/MEC schemes will be unsuitable for implementation in NoC interconnects.

#### Energy Dissipation of ECC schemes in a wireline NoC

In NoC architectures the functional cores communicate with each other through switches. We assume wormhole routing [16] as the data transport mechanism where the packet is divided into fixed length flow control units or flits. When flits travel between the switches on the interconnection network, both the inter-switch wires and the logic gates in the switches toggle, resulting in energy dissipation. To quantify the energy dissipation characteristics of the proposed schemes, we need to determine the energy dissipated per cycle by the entire NoC fabric. In the uncoded case, the energy dissipated per cycle is given by

$$E_{NoC/cycle}^{uncoded} = E_{link}^{uncoded} \eta_{int\,erswitch} + \frac{E_{switch}}{N_{switch}} \eta_{intraswitch} \cdot$$
(4.24)

Where,  $E_{link}^{uncoded}$  and  $E_{switch}$  are the energy dissipation of the interswitch link and the NoC switches respectively. The numbers of flits traversing the interswitch and the intraswitch stages in a single cycle are given by  $\eta_{interswitch}$  and  $\eta_{intraswitch}$  respectively. The NoC switch architecture adopted for this paper has multiple pipelined stages as discussed later in this section. Since a single flit can not occupy more than one stage in one cycle the energy dissipation of the switch per flit per cycle is obtained by dividing  $E_{switch}$  by the number of stages that it is pipelined into,  $N_{switch}$ . After incorporating the coding schemes the energy dissipation per cycle can be obtained as shown below.

$$E_{NoC/cycle}^{coded} = E_{link}^{coded} \eta_{int\ erswitch} + \left(\frac{E_{codec} + E_{int\ erface}}{N_{codec}} + \frac{E_{switch}}{N_{switch}}\right) \eta_{intraswitch}$$
(4.25)

Where,  $E_{codec}$  and  $E_{interface}$  are the energy dissipations of the codecs and the interface circuitry used to obtain low voltage on the interconnects. Similar to the switch, the energy dissipation of

the codecs per cycle need to be considered and are hence divided by the number of stages,  $N_{codec}$ . The pipelined architecture in presence of coding is shown in figure 4.12.

The main reason for incorporating coding in NoCs is to achieve the dual purpose of enhancing reliability and lowering energy dissipation. The principal source of lowered energy dissipation is the reduced voltage swing on the interconnects enabled by increased reliability through coding. Additionally, lowering the effective crosstalk capacitance of inter-switch wires augments the gains in energy savings. However, while computing the energy dissipation profiles the overheads caused by the coding schemes must also be taken into account. The coding schemes introduce redundant bits in the flits and hence increase the number of wires. The extra wires also dissipate energy and hence are considered as a part of  $E_{link}^{coded}$  in equation 4.25. The encoders and decoders including the interface circuitry used to achieve a lower voltage swing on the wires also dissipate energy and are included in the computation in equation 4.25. Following this, the savings in energy compared to the uncoded case in each cycle,  $E_{savings/cycle}$  is given as

$$E_{savings/cycle} = E_{NoC/cycle}^{uncoded} - E_{NoC/cycle}^{coded} \qquad . \tag{4.26}$$

 $E_{NoC/cycle}^{uncoded}$  can be calculated using equation 4.24 considering the fact that there is no codec and interface overhead, while  $E_{NoC/cycle}^{coded}$  can be calculated from equation 4.25 considering all the overheads. Therefore, it can be seen from equations 4.24, 4.25 and 4.26 that the savings in energy dissipation compared to the uncoded case does not depend on the energy dissipation of



Figure 4.12. Pipelined data path through a NoC switch including codecs.

the NoC switches.

The energy dissipated in each switch,  $E_{switch}$ , and each codec,  $E_{codec}$  is determined using Synopsys<sup>TM</sup> Prime Power. The interconnect energy,  $E_{link}$ , depends on the length of each interswitch wire segment which varies depending on the NoC topology [16]. For Mesh architecture the inter-switch wire length is given by

$$l = \frac{\sqrt{Area}}{\sqrt{M} - 1}.\tag{4.27}$$

Where, Area is the area of the silicon die used and M is the number of cores in the NoC.

The capacitances of each interconnect stage and subsequently  $E_{link}$  was obtained through HSPICE simulations taking into account the specific layout for each topology [16]. The energy dissipated by the low-swing interface circuitry was also obtained through HSPICE simulations. To obtain the number of flits traversing each stage per cycle  $\eta_{interswitch}$  and  $\eta_{intraswitch}$ , a cycleaccurate network simulator is employed. It is flit-driven and uses wormhole routing. The simulator is capable of handling different types of traffic injection process. Messages can be injected by each core into the network following different stochastic distributions. In our experiments the traffic injected by the functional cores followed self-similar distributions [17]. This type of traffic has been observed in the bursty traffic typical of on-chip modules in MPEG-2 video applications [18], as well as various other networking applications [19]. It has been shown to closely model real traffic.

In order to characterize the performance of the proposed coding schemes in NoC communication infrastructures, we considered a system consisting of 64 cores and mapped them onto a Mesh based wireline NoC. We assumed the NoC to be spread over a die size of 20mmx20mm. We compared the performance of ED, DAP, CADEC, JTEC and JTEC-SQED schemes. Since DAP, MDR and BSC are all joint crosstalk avoidance and single error correction

codes their performance is very similar and hence we have shown only one representative scheme namely DAP for the sake of comparison. The routing mechanism used in the simulations is *e-cube* (dimension order) routing. The particular switch architecture adopted [16] had three functional stages, namely, input arbitration, routing/switch traversal and output arbitration. The input and output ports have 4 virtual channels, each having buffer depth of 2 flits. The pipelined data path of a flit through this switch architecture along with the encoder and decoder blocks is shown in figure 4.12. The energy dissipations as functions of injection load are plotted for each of the three NoC architectures mentioned above. The injection load is measured as the number of flits injected by each core into the network in each cycle. The energy dissipation profiles give the energy dissipated by all messages in the NoC per simulation cycle.

Simulations were performed using 65nm standard cell libraries from CMP [15]. The clock cycle was assumed to be 400 ps, which is typical for this process [20]. A large set of data patterns were fed into the gate-level netlists of the switch blocks and codecs and by running Synopsys<sup>TM</sup> Prime Power their energy dissipation was obtained.

All the schemes have different number of bits in the encoded flit. A fair comparison in terms of energy savings demands that the redundant wires be also taken into account while comparing the energy dissipation profiles. The metric used for comparison thus takes into the account the savings in energy due to the reduced crosstalk and reduced voltage level on the wires, the additional energy dissipated by the codecs, the extra redundant wires and the interface circuitry used to achieve reduced voltage swing on the interconnect. Energy dissipated by the retransmission buffers and control signals requesting retransmissions for the ED and JTEC-SQED schemes are also considered. An uncoded 32-bit wide flit is considered as the standard for comparison. The switch blocks and the codecs are driven with the nominal V<sub>dd</sub> of 1V, whereas

the inter-switch wires are driven by the lowered voltage swing as explained. To achieve the lower voltage swing on the interconnects the Level Converting Register (LCR) [21] interface was incorporated in the switch blocks. This particular interface circuitry enables a quadratic reduction in the



Figure 4.13. Energy Dissipation Profile for the Mesh based NoC.

energy dissipation on the inter-switch wires due to the use of NMOS only push-pull drivers driven by a lower voltage signal [21]. As the coding schemes under consideration have different number of encoded bits in a flit their interface energy values also vary. The total NoC energy dissipation in a single clock cycle can be obtained using equations 4.24 and 4.25.

It may be noted that as shown in equation 4.26 the absolute value of the savings in energy dissipation remains unchanged irrespective of the particular switch implementation, however the percentage savings over the uncoded baseline case depends on the energy dissipation by the switch and hence may vary with the particular implementation style. Energy dissipation of NoC switches are shown to vary widely [22] [23] [24]. However irrespective of the particular switch design the overall savings in energy remains unchanged due to coding.

Figure 4.13 shows the energy dissipation profile per cycle for all the coding schemes (ED, DAP, CADEC, JTEC and JTEC-SQED) in a Mesh-based NoC architecture. The channel BER is assumed to be  $10^{-20}$  [9] in these simulations.

The energy expenditure per cycle is least in the case of JTEC-SQED, followed by JTEC, as those can reduce the voltage swing more than any of the other schemes due to their quadruple error detection and triple error correction capability, as discussed in section 4.1.7. In addition to this, the joint codes (DAP, CADEC, JTEC and JTEC-SQED) also reduce the effective mutual switching capacitances on the inter-switch wire segments, which is another contributing factor in lowering the energy dissipation. Hence, in the WiNoC the JTEC-SQED scheme is used on the wireline links as it achieves the minimum energy dissipation among all the ECC schemes.

It may be noted at this point that the ECC codecs add timing overheads to the NoC links.

Consequently, the latency of packet transfer increases for all the coding schemes compared to the uncoded NoC. Figure 4.14 shows the latency characteristics of the all the ECC schemes described here. The codec of the JTEC and JTEC-SQED schemes are very



Figure 4.14. Variation of Average Message Latency with injection Load for a wireline Mesh NoC.

similar and hence incur the same latency penalties. Hence, they are shown as a single plot labeled for JTEC. Due to the relatively complex codecs of the CADEC and JTEC scheme their timing overheads are higher than ED or DAP. However, as shown in figure 4.13 the JTEC-SQED is the most energy efficient scheme.

# 4.2 Error Control Coding for the Wireless Links

The performance of the wireless links in the WiNoC depends on the CNT antennas. Like any

other nanodevices, CNT antennas are expected to have higher manufacturing defect rates, operational uncertainties and process variability [25]. As mentioned in the earlier subsection, when TDM and FDM are used for channelization of the flit, one antenna element is responsible for the transmission of multiple bits in a flit. Thus, malfunction of one antenna element will affect multiple contiguous bits. In addition, due to multipath reflections from the surface of the substrate and the packaging material the ratio of the intended line-of-sight (LOS) to the reflected power can be quite low depending on the locations of the transceivers. Therefore the coding scheme should be robust against both random and burst errors.

As a first step, in the next subsection we present a model of the on-chip wireless channel and evaluate the corresponding bit error rates.

# 4.2.1 Wireless Channel Model

By elevating the chip packaging material from the substrate to create a vacuum for transmission of the high frequency EM waves, LOS communication between WBs using CNT antennas at optical frequencies can be achieved. Techniques for creating such vacuum packaging are already utilized for MEMS applications [26], and can be adopted to make creation of LOS communication between CNT antennas viable. However, reflection from the surfaces of the substrates and the packaging material interfere with the speculative LOS transmitted power. In the channel model we account for the multipath reflection from all 6 surfaces of the packaging as well as the thermal noise coupled to the received signal from the LOS transmission. Figure 4.15 shows the multipath transmission of the signal from the transmitter to the receiver with all the possible reflected rays from four walls, ceiling and ground. The total received power is given by



Figure 4.15. Multi-path channel model for on-chip wireless links.

$$P_{R} = \frac{A_{R}}{4\pi}P_{T} \left| G_{T,LOS} \frac{e^{-j\frac{2\pi}{\lambda}R_{LOS}}}{R_{LOS}} + \Gamma_{1}G_{T1} \frac{e^{-j\frac{2\pi}{\lambda}R_{1}}}{R_{1}} + \Gamma_{2}G_{T2} \frac{e^{-j\frac{2\pi}{\lambda}R_{2}}}{R_{2}} + \Gamma_{3}G_{T3} \frac{e^{-j\frac{2\pi}{\lambda}R_{3}}}{R_{3}} + \Gamma_{4}G_{T3} \frac{e^{-j\frac{2\pi}{\lambda}R_{4}}}{R_{4}} \right|^{2} (4.28) + G_{T,ceiling}\Gamma_{celling}\Gamma_{celling}\frac{e^{-j\frac{2\pi}{\lambda}R_{celling}}}{R_{ceiling}} + G_{T,ground}\Gamma_{ground}\frac{e^{-j\frac{2\pi}{\lambda}R_{ground}}}{R_{ground}} \right|^{2}$$

Where,  $G_{T,LOS}$  is the transmitter antenna gain along the LOS, which is shown to be -5dB [27].  $A_R$ is the area of the receiving antenna and  $R_{LOS}$  is the LOS distance between the transmitter and receiver.  $R_1$ ,  $R_2$ ,  $R_3$ ,  $R_4$ ,  $R_{ceiling}$  and  $R_{ground}$  are the distances along the different reflected paths.  $\Gamma_1$ ,  $\Gamma_2$ ,  $\Gamma_3$ ,  $\Gamma_4$ ,  $\Gamma_{ceiling}$  and  $\Gamma_{ground}$  are the coefficients of reflections on the surfaces.  $G_{T1}$ ,  $G_{T2}$ ,  $G_{T3}$ ,  $G_{T4}$ ,  $G_{T,ceiling}$  and  $G_{T,ground}$  are the antenna gains along the directions of the reflected paths which are all less than -10dB. The thermal noise power is given by the equation below.

$$N_0 = kT_0F = kT_0 \left| \frac{T_{antenna}}{T_0} + F_r \right|$$
(4.29)

Where, k is the Boltzmann constant,  $T_0$  is the room temperature taken as 290K and  $T_{antenna}$  is the



Figure 4.16. SNR over the die area due to multipath radiation from a transmitter placed in the first subnet at its centre (X=2.5mm, Y=2.5mm).

temperature of the antenna assumed to be 330K [28] and  $F_r$  is the receiver noise figure of 4dB [29]. The coupling of the chip switching noise to the wireless channels is negligible as the wireless channels are in very high frequency bands of a few THz. Hence, the Signal-to-Noise Ratio (SNR) is given by

$$SNR = \frac{P_R}{N_0}.$$
(4.30)

Figure 4.16 shows the variation of the SNR over the 20mm x 20mm die area for a fixed position of the transmitter. The transmitter in this case is placed at the centre of the subnet at the near corner in figure 4.16 where the received SNR is the maximum. Even though there is multipath reflection from the substrate and packaging walls, the reflected power in negligible as signified

by the quadratic variation of the received SNR with distance in figure 4.16. This is due to low coefficient of reflection of the reflecting surfaces and high directivity of the antennas which attenuate the radiations in the directions other than the LOS. The SNR vs. BER characteristics for the wireless links correspond to the adopted modulation scheme, which is non-coherent OOK. Figure 4.17 shows the variation of BER with SNR for a non-coherent OOK receiver. The particular configuration of the wireless links established for optimal network performance by simulated annealing places the 24 links between specific subnets in the WiNoC. The SNR and hence the BER corresponding to those links are also marked with red circles in the plot. The SNR and hence the BER on those 24 links varies as each link is of a different length resulting in different path loss and reflected radiation patterns. However, some of the 24 links have the same



Figure 4.17. SNR vs. BER plot of the wireless channel with and without Coding.

SNR and hence appear as the same point on the plot. As can be seen the highest BER for the wireless links on the chip is around  $4x10^{-4}$ . Since, this is the highest BER among all the wireless links it can be referred to as the effective BER of the WiNoC. This effective BER of the WiNoC is much higher compared to the BER of wireline links which is typically around  $10^{-20}$  to  $10^{-15}$  as noted in [9]. Hence, we propose using powerful multiple/burst error correction codes for the wireless channels. In the next subsection we describe the particular ECC proposed to enhance the reliability of the wireless links.

## 4.2.2 Proposed Product Code for the Wireless Links

In order to achieve simultaneous random and burst-error correction on the wireless links simple Hamming code based Product Codes (H-PC) are proposed in this work. In [30] the authors have already shown that Product Codes designed from simple single error correcting Hamming codes in 2 dimensions can perform better than multiple error correction codes like Bose-Chaudhuri-Hocquenghem (BCH) codes or Reed-Solomon (RS) codes in terms of trade-offs between overall performance and overhead. In this work we propose to use a simple Product Code which achieves multiple error correction as well as burst error correction of data transmitted through the wireless links. Figure 4.18 shows the schematic structure of the Product Code encoder. Considering a flit size of  $k_1$  bits a  $(n_1, k_1)$  Hamming coding is performed in the spatial dimension on the flits and in a block of  $k_2$   $(n_1, k_1)$  Hamming encoded flits a  $(n_2, k_2)$ Hamming encoding is done in the time dimension to give a  $(n_1 \ge n_2, k_1 \ge k_2)$  Product Code. In our work, we chose a (38, 32) Hamming code in the spatial dimension to encode a whole 32 bit flit at a time. In the time dimension a (7, 4) Hamming code is chosen to minimize the latency and buffering overheads of storing a block of bigger size. Any bigger code would cause higher buffering requirements at the wireless nodes in the WiNoC. The Product Code decoder utilizes a row-column decoding technique where first the (38, 32) Hamming decoder operates on the received columns of the block and then the (7, 4) Hamming decoders decode the rows to give back the 4 received 32 bit flits. This decoding technique is referred to as the column-first decoding technique. In order to mask the latency penalty of the decoding, the decoder is designed such that the (38, 32) Hamming decoder operates on the received flits as they arrive and 32 parallel (7, 4) Hamming decoders then operate in parallel on the received 32 bit flits. This minimizes the latency overhead of the H-PC decoder.

## 4.2.3 Residual BER of the wireless channel with H-PC

In order to estimate the effectiveness of the proposed coding scheme we perform a residual BER analysis after implementing the ECC. In order to do this we find out the error events that are uncorrectable by the Product code. Figure 4.19 shows the various uncorrectable error patterns in a block of size  $n_1 \ge n_2$  for the row-column decoding technique. The scenario shown in figure 4.19(a) has a spatial burst along a particular flit represented by the shaded column, as well as single bit random errors in other flits. This can be corrected if column decoding is done first



Figure 4.18. Schematic structure of the proposed H-PC encoder

followed by row decoding, as in the adopted column-first decoding. The column decoding will correct all the random errors but not the burst, which, however, will be corrected by row decoding. For the scenario in figure 4.19(b) with a single burst in time and some random errors, this column-first decoding scheme will correct all errors if the uncorrectable error patterns on the columns produce correctable error patterns on the rows after column decoding. This case is however, not likely to occur because each antenna is responsible for transmission of multiple bits in the same flit before transmitting bits of another flit. The case shown in figure 4.19(c) with a burst in each direction can be corrected completely, as column decoding will correct the burst in time except the top left bit. The resulting pattern is only a burst in space which can be corrected by row-column decoding because double errors occur in both rows and columns. This is the most high probability event as only 4 erroneous bits in a whole block can lead to uncorrectable error patterns. The probability of this event is given by

$$P_{\text{rectangle}} = N_{\text{rectangle}} \varepsilon^4 (1 - \varepsilon^{n_1 \times n_2 - 4})$$
(4.31)

Where,  $N_{\text{rectangle}}$  is the number of such rectangular patterns possible in a block of size  $n_1 \ge n_2$  as given by equation 4.32 and  $\varepsilon$  is the BER without coding.



Figure 4.19. Different correctable error patterns
$$N_{\text{rectangle}} = \sum_{l=2}^{n_1} \sum_{b=2}^{n_2} \left( n_1 - l + 1 \right) (n_2 - b + 1)$$
(4.32)

Hence, the residual BER with the Product Code is given by

$$\varepsilon_{PC} = \frac{1}{n_1 n_2} \left| P_{\text{rectangle}} \right|$$
(4.33)

The new SNR vs. residual BER on the wireless links are shown in figure 4.17 after implementing the Product code. The effect of the product code is a significantly lower residual BER. The worst case BER on the link with the highest SNR is  $1.99 \times 10^{-12}$ . Clearly, the BER becomes less achieving a higher level of reliability compared to the uncoded system. In addition, the worst case BER of the wireless channel becomes comparable with the BER in the wireline links by using the H-PC scheme.

#### 4.3 Experimental Results

In order to characterize the performance of the proposed coding schemes in a WiNoC, we consider a system consisting of 256 cores. As indicated in Chapter 3, to achieve the highest system throughput, the WiNoC is divided into 16 subnets each with 16 cores and 24 wireless links distributed among the subnets. As mentioned in Section 3.2 the locations of the antennas are chosen to maximize the corresponding SNR. We assume a die size of 20mmx20mm. The switch architecture is adopted from [16]. The proposed H-PC code is used in the wireless links of the WiNoC. It is shown in section 4.1 the JTEC-SQED is the most energy efficient coding scheme in the wireline links. Hence, the JTEC-SQED code is used on the wireline links of the WiNoC. The H-PC and JTEC-SQED encoder is placed on each outgoing port of the hub or switch before the wireless or wireline links. The decoder is placed on each port after the links and before the hub or switch.

The network switches, hubs and codecs for the ECCs are synthesized from a RTL level



Figure 4.20. Packet energy dissipation and worst case channel BER for WiNoC and mesh architectures with and without ECC.

design using 65nm standard cell libraries from CMP [15], using Synopsys Design Vision and assuming a clock frequency of 2.5 GHz. The energy dissipation on the wires was obtained from CADENCE Spectre. The energy dissipation on the wireless links was obtained from equation 3. The WiNoC is simulated using a cycle accurate simulator which models the progress of data flits accurately per clock cycle accounting for flits that reach destination as well as those that are dropped. Figure 4.20 shows the packet energy dissipation of the WiNoC with and without the proposed unified ECC scheme. Packet energy dissipation is the energy dissipated in transfer of one packet from source to destination. For comparison, the packet energy dissipation in a completely wireline mesh NoC with and without ECC is also shown. The ECC used in the wireline links of the subnets is the JTEC-SQED as it is the most energy efficient coding scheme as shown in figure 4.13 [31]. The effective BER on the communication links in the corresponding cases are also shown. For the WiNoCs the BER on the wireless channels is much higher than that of the wireline links and hence this represents the effective BER on the

communication links. As shown in [31] the packet energy of the wireline mesh reduces due to JTEC-SQED as the voltage swing on the wireline links can be reduced significantly according to equation 15 and shown in figure 4.10. In addition, the crosstalk coupling capacitance on the wires are reduced. The reductions in energy savings due to these two factors are more than the overheads due to the codecs and redundant links [31]. The overall reliability or the BER on the wires is the same in both cases as the reduction is energy dissipation is projected for the same level of reliability or BER on the wires. For the WiNoC without ECC the worst case BER is much higher due to low SNR. But with the powerful H-PC coding on the wireless channels the BER of the wireless channels are reduced to around  $10^{-12}$ . Hence, the overall reliability of the WiNoC is improved with the coding schemes. The overheads of the H-PC scheme and the JTEC-SQED are considered and we find that there is an increase in the packet energy due to the codec overheads of the ECCs compared to the WiNoC without any coding. However, at the cost of this slight increase in packet energy we are able to achieve the same overall BER on the WiNoC as that of the completely wireline mesh NoC. The packet energy dissipation of the WiNoC with ECC however still remains several orders of magnitude lower than that of a complete wireline counterpart.

We also estimate the timing characteristics of the WiNoC in comparison to the wireline mesh. The latency is defined as the number of cock cycles required for the transfer of a data packet from source to destination. The latency of the wireline mesh with JTEC-SQED is higher than that of the mesh without ECC. This is because the ECC codecs add overheads to the critical paths in the NoC switches since the encoder is added as the last stage and the decoder is the first stage of the switch respectively. The WiNoC without any ECC achieves a much lower latency compared to the wireline network due to the advantages of the long-range wireless shortcuts introduced into the network as well as the hierarchical division of the NoC. Due to the wireless shortcuts in the WiNoC, the average hop-count between cores is much

| Coding    | Codec Delay (ps) |         |
|-----------|------------------|---------|
| Scheme    | Encoder          | Decoder |
| H-PC      | 150              | 354     |
| JTEC-SOED | 133              | 315     |

Table 4.2. Delay for Each Coding Scheme

less compared to that of a mesh of the same size. Hence, as shown in [32] the performance of the WiNoC is much better compared to the wireline mesh NoC. However, similar to the JTEC-SQED scheme, the H-PC codec also adds timing overheads to the wireless links. The encoder design requires a (38, 32) encoding on each 32 bit flit which then are stored until 4 flits are received. All 7 flits are then encoded with 38 parallel (7, 4) Hamming encoders. Hence, the overhead of the H-PC encoder is equal to the delay of one (38, 32) Hamming encoder and one (7, 4) Hamming encoder as all the (7, 4) encoders operate in parallel. The delays of the various stages of the H-PC and JTEC-SQED schemes are shown in table 4.2. The latency penalty due to the code-rate of (38x7)/(32x4)=2.078 is also taken into account. In the wireline links however, extra cycles were not required due to the code-rate as the redundant bits could be transferred in the same cycle with additional wires. Figure 4.21 shows the overall latency characteristics of the WiNoC and the wireline mesh with and without coding. Due to the high code rate of the H-PC as well as codec overheads the overall latency of the WiNoC with coding is higher than the WiNoC

the latency of the WiNoC is much less compared to that of a wireline mesh NoC without any coding. Hence, with coding in WiNoC we can achieve a much lower latency compared to wireline mesh

without coding. However, even with coding

Table 4.3. Area Overhead of the Codec for Each Coding Scheme

| Coding<br>Scheme | Area (µm <sup>2</sup> ) |  |
|------------------|-------------------------|--|
| H-PC             | 29710                   |  |
| JTEC-SQED        | 11055                   |  |

architecture without ECC.

The codecs introduce additional hardware components and hence also require silicon area overheads. Table 4.3 summarizes the area overheads of each of the coding schemes. The reported area is the area of the codec required per port of the switches or hubs.



Figure 4.21. Latency characteristics of mesh and WiNoC architectures with and without ECC.

### 4.4 Conclusions

According to ITRS [20], signal integrity is expected to be an increasingly critical challenge in designing SoCs. The widespread adoption of the NoC paradigm will be possible if it addresses system level signal integrity and reliability issues in addition to easing the design process, and meeting all other constraints and objectives. With shrinking feature size and use of nano-photonic devices one of the major factors affecting signal integrity is transient errors, arising due to temporary conditions of the NoC and environmental factors. In this chapter it is shown that by using ECC schemes on both wireline and wireless links of the WiNoC the overall reliability of the wireless NoC can be improved while still dissipating orders of magnitude less energy in data transfer over the WiNoC fabric.

## 4.5 Reference

- [1] D. Bertozzi, L. Benini, G. De Micheli, 'Error Control Schemes for On-Chip Communication Links: The Energy-Reliability Tradeoff ', IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 6, June 2005, pp. 818-831.
- [2] D. Sylvester and C. Hu, "Analytical modeling and characterization of deep-submicrometer interconnect," *Proc. IEEE*, vol. 89, no. 5, pp. 634–664, May 2001
- [3] E. Dupont, M. Nicolaidis, P. Rohr, "Embedded Robustness IPs for Transient-Error-Free ICs", IEEE Design and Test of Computers, Volume 19, Issue 3, May-June 2002 pp: 54 – 68.
- [4] N. R. Shanbhag, M. Zhang, "Soft-Error-Rate-Analysis (SERA) Methodology", IEEE Transactions on Computer Aided Design of Circuits and Systems, Vol. 25, Issue 10, Oct. 2006, pp. 2140-2155.
- [5] P. P. Pande, H. Zhu, A. Ganguly, C. Grecu, "Crosstalk-aware Energy Reduction in NoC Communication Fabrics", Proceedings of IEEE International SOC Conference, SOCC 2006, 24th-27th September, 2006, pp: 225-228.
- [6] D. Bertozzi, L. Benini, G. De Micheli, 'Error Control Schemes for On-Chip Communication Links: The Energy-Reliability Tradeoff ', IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 6, June 2005, pp. 818-831.
- [7] S. Murali, G. De Micheli, L. Benini, T. Theocharides, N. Vijaykrishnan, and M. Irwin, "Analysis of Error Recovery Schemes for Networks on Chips," *IEEE Design & Test of Computers*, vol. 22, no. 5, 2005, pp. 434-442.
- [8] D. Rossi, C. Metra, A, K. Nieuwland and A. Katoch, "Exploiting ECC Redundancy to Minimize Crosstalk Impact", *IEEE Design & Test of Computers*, Volume 22, issue 1, Jan 2005, pp:59 – 70.
- [9] S. R. Sridhara, and N. R. Shanbhag, "Coding for System-on-Chip Networks: A Unified Framework", *IEEE Transactions on Very Large Scale Integration (TVLSI) Systems*, vol. 13, no. 6, June 2005, pp. 655-667.
- [10] S. Lin & D. J. Costello, Error Control Coding: Fundamentals and Applications, Prentice-Hall, 1983.

- [11] D. Rossi, C. Metra, A, K. Nieuwland and A. Katoch, "Exploiting ECC Redundancy to Minimize Crosstalk Impact", *IEEE Design & Test of Computers*, Volume 22, issue 1, Jan 2005 pp:59 – 70.1
- [12] P. P. Pande, A. Ganguly, B. Feero, B. Belzer, C. Grecu, "Design of Low power & Reliable Networks on Chip through joint crosstalk avoidance and forward error correction coding", Proceedings of 21<sup>st</sup> IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 06), 4<sup>th</sup>-6<sup>th</sup> October, 2006.
- [13] Amlan Ganguly, Partha Pande, Benjamin Belzer, Cristian Grecu, "Design of Low power & Reliable Networks on Chip through joint Crosstalk Avoidance and Multiple Error Correction Coding", Journal of Electronic Testing: Theory and Applications (JETTA), Special Issue on Defect and Fault Tolerance, June 2008, pp. 67-81.
- [14] M. Y. Hsiao, "A Class of Optimal Minimum Odd-weight-column SEC-DED Codes", IBM J. Res. Dev., July 1970, pp. 395-401.
- [15] CMP 90 nm technology library, http://cmp.imag.fr/products/ic/?p=STCMOS090
- [16] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, "Performance Evaluation and Design Trade-offs for Network on Chip Interconnect Architectures", *IEEE Transactions on Computers*, vol. 54, no. 8, August 2005, pp. 1025-1040
- [17] K. Park, W. Willinger, *Self-similar Network Traffic and Performance Evaluation*, John Wiley & Sons, 2000.
- [18] D. R. Avresky, V. Shubranov, R. Horst, P. Mehra, "Performance Evaluation of the ServerNet<sup>R</sup> SAN under Self-Similar Traffic" Proceedings of 13<sup>th</sup> International and 10<sup>th</sup> Symposium on Parallel and Distributed Processing, April 12-16<sup>th</sup>, 1999, pp. 143-147.
- [19] G. V. Varatkar and R. Marculescu., "On-chip traffic modeling and synthesis for MPEG-2 video applications", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 8, Issue 3, June 2000 pp:335 339
- [20] ITRS 2007, http://www.itrs.net/Links/2007ITRS/Home2007.htm
- [21] H. Zhang, V. George, J. Rabaey, "Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 8, Issue 3, June 2000, pp. 264-272.
- [22] S. Murali, G. De Micheli, L. Benini, T. Theocharides, N. Vijaykrishnan, and M. Irwin, "Analysis of Error Recovery Schemes for Networks on Chips," IEEE Design & Test of Computers, vol. 22, no. 5, 2005, pp. 434-442.
- [23] A. Kumar et al., "A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS", Proceedings of IEEE International Conference on Computer Design (ICCD), 7-10 October, 2007

- [24] D. Milojevic, I. Montperrus, and D. Verkest, "Power dissipation of the network-on-chip in a system-on-chip for MPEG-4 video encoding", Proceedings of IEEE Asian Solid-State Circuits Conference, 2007. ASSCC, pp. 392-395
- [25] R. I. Bahar et al. "Architectures for Silicon Nanoelectronics and Beyond," IEEE Computer, Vol. 40, Issue 1, January 2007, pp. 25-33.
- [26] T.S. Marinis, et. al., "Wafer level vacuum packaging of MEMS sensors," Proc. of Electronic Components and Technology Conference, 2005. 31 May-3 June 2005, Vol. 2, pp.1081 - 1088.
- [27] K. Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, vol. 19, 2007, pp. 421-426.
- [28] Y. P. Zhang, "Bit-Error-Rate Performance of Intra-Chip Wireless Interconnect Systems", IEEE Communications Letters, Volume: 8, Issue: 1, 2004, pp. 39-41.
- [29] A. Ismail and A. Abidi, "A 3 to 10GHz LNA Using a Wideband LC-ladder Matching Network," Proceedings of IEEE International Solid-State Circuits Conference, 15-19 February, 2004, pp. 384-534.
- [30] B. Fu and P. Ampadu, "Error control combining Hamming and product codes for energy efficient nanoscale on-chip interconnects," IET Computers & Digital Techniques, vol. 4, no. 3, pp. 251-261, May 2010.
- [31] Amlan Ganguly, Partha Pande, Benjamin Belzer, "Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NoC Interconnects", IEEE Transactions on VLSI (TVLSI) Vol. 17, No.11, November 2009, pp. 1626-1639.
- [32] Amlan Ganguly, Kevin Chang, Sujay Deb, Partha Pande, Benjamin Belzer, Christof Teuscher, "Scalable Hybrid Wireless Network-on-Chip Architectures for Multi-Core Systems", IEEE Transactions on Computers (**TC**), June, 2010, *accepted for publication*.

# **Chapter 5**

# **Conclusions and Future Work**

This chapter concludes the work undertaken in this thesis by summarizing the salient contributions. It also points towards various promising future directions emanating from this research endeavor.

### 5.1 Conclusions

Massive levels of integration is making modern Multi-core chips all-pervasive in several domains ranging from scientific applications like weather forecasting, astronomical data analysis, bioinformatics applications to even consumer electronics. Design of multi-core integrated systems beyond the current CMOS era will present unprecedented advantages and challenges, the former being related to very high device densities and the latter to soaring power dissipation issues. According to the International Technology Roadmap for Semiconductors (ITRS) in 2007, the contribution of interconnects to chip power dissipation is expected to increase from 51% in the 0.13µm technology generation to up to 80% in the next five year period. This clearly indicates the challenges facing future chip designers associated with traditional scaling of conventional metal interconnects and material innovation. To enhance the performance of conventional metal interconnect-based multicore chip a few radically different interconnect technologies are being currently explored; such as 3D integration, Photonic interconnects and multi-band RF interconnects. All these new technologies have been predicted to be capable of enabling multi-core designs which improve the speed and power dissipation in data transfer significantly. However, these alternative interconnect paradigms are in their formative stage and need to overcome significant challenges pertaining to integration and reliability. This opens up new opportunities of multidisciplinary research in the domain of multicore chip design. So far, all these interconnect technologies have been used in existing multicore platforms without significant architectural innovations which undermines their adoptability in the face of challenges related to reliable manufacturability.

It is shown in this work that NoC architectures inspired from Complex Network Theory using wireless interconnects can achieve significant performance benefits compared to traditional wireline NoCs under both synthetic and real application based network traffic. It is also shown that the wireless interconnect enabled WiNoC can perform better than NoC with other alternative interconnect technologies. In addition, it also demonstrated that by using efficient and simple error control codes specifically designed for on-chip links, it is possible to enhance the level of reliability of such wireless NoCs while still achieving considerable gains in the performance and energy dissipation.

#### 5.2 Future Directions

The research performed for this thesis work can be carried forward in several far reaching directions as discussed below:

#### 5.2.1 Wireless NoCs with millimeter-wave Interconnects

The small world topology based NoC can be alternately designed using on-chip millimeter (mm)-wave wireless links designed in traditional CMOS technology as long-range communication channels between distant cores in a NoC. Recent investigations have established characteristics of the silicon integrated on-chip antenna operating in the mm-wave range of a few tens to hundred GHz and it is now a viable technology [1]. Coupled with significant advances in mm-wave transceiver design this opens up new opportunities for detailed investigations regarding mm-wave wireless NoC (mWNoC). Even though such mm-wave technology will provide a lower aggregate wireless bandwidth than the CNT based technology, the former is

readily CMOS compatible and is hence a more near-term realizable solution. Hence, the advantages of the novelties in architecture design outlined in this work can be demonstrated on real test vehicles on silicon sooner than by using CNT antennas. Consequently, this technology may be adopted by mainstream industry faster than the CNT based technology.

#### 5.2.2 Extension of the ECC schemes

The Hamming code based Product coding used on the wireless links can be modified to make the product code stronger. Using multiple error correction codes like Reed-Solomon code or Bose-Choudhuri-Hocquenqem will enable correction of multiple bursts and handle really high error rates [2]. However, the overheads of having multiple error correcting codes need to be investigated as such codes are typically more complex than SEC Hamming codes. Moreover, for wireless networks with mm-wave antennas and transceivers the events causing errors in transmission will be different and hence would require new investigations. Depending upon particular coding schemes adopted the error rates on the mm-wave channels may be different and may potentially require different coding schemes. Design of such coding schemes tailored for the mm-wave wireless NoC is also part of future research goals.

### 5.2.3 Complex Network based WiNoC architectures

It is imperative that emerging interconnect paradigms replace or at least augment traditional metal interconnects for on-chip communication in future many-core chips. This will enable many-core chips to deliver the power-performance demands in the extremely demanding target application areas. To achieve this paradigm shift in design of interconnect infrastructures for massive multi-core chips, fault-tolerance in inherently unreliable technology must be addressed with radical and effective techniques. A complex network theory based interconnection architecture is a step in this direction. Theoretical studies in complex networks show that certain

types of network connectivity are inherently more resilient to faults and failures [3]. Adopting novel architectures inspired by complex network theory in conjunction with the emerging interconnection technologies will enable design of high-performance, robust multi-core chips.

Each of the emerging interconnect technologies, vis., 3D integration, photonic NoCs or wireless/RF NoCs pose significant challenges related to their reliable integration. Vertical Through-Silicon-Via (TSV) is an enabling technology for 3D integration of chips. However, misalignment of layers during 3D stacking can result in TSV failure impairing the performance benefits of 3D NoCs. High power densities in 3D chips lead to thermal issues which may aggravate metal via failure rates. The reliable integration of silicon nanophotonic devices and waveguides to make Photonic NoCs a reality is a major challenge and is hence a subject of ongoing research. NoCs with RF interconnects need laying long on-chip transmission lines across the chip along with a bank of precision, high frequency filters and oscillators. Design of such high-precision analog components is non-trivial. Wireless interconnects with either Carbon Nanotube (CNT) based on-chip antennas or mm-wave metal antennas may encounter significant failure rates pertaining to issues of integration and transceiver design respectively. NoCs using these emerging interconnects demand high performance from inherently unreliable technology. With technology scaling due to shrinking device geometries in the future this issue can be predicted to even increase in importance. Hence, traditional fault-tolerance techniques like adaptive routing strategies [4] and error control coding (ECC) [5] will not be sufficient to address these issues specially with technology scaling.

Such challenges in reliability and integration demand radically different approaches to make these emerging interconnect paradigms viable for large-scale adoption. Natural complex networks often demonstrate surprising robustness against high degree of malfunctions, viz.,

107

microbes are known to persist and reproduce even in the presence of harsh external interferences. Large networks having a connectivity structure known as Scale-Free graphs are characterized by a few highly connected nodes and many peripheral nodes with very few connections. Under these conditions random faults result mostly in failure of those nodes which have very few connections as they occur in large majority. However failure of these nodes only marginally affects the entire network due to their relatively few connections. On the other hand these networks are very vulnerable to preferential failures of the important nodes which are highly connected. In contrast, a connectivity pattern known as Small-World graphs are characterized by near equal connectivity of all nodes. These networks are consequently similar in performance to random as well as preferential failures. Hence, depending on the failure patterns of the emerging interconnect technologies network architectures inspired from complex network theory can be designed which may provide inherent reliability against such faults.

The goal of this research would be to explore reliable NoC architectures using emerging interconnect paradigms. Architectures inspired from scale-free or small-world graphs will have different performance characteristics in presence of various failure patterns. Development of failure models for these emerging interconnects will be undertaken. Using these failure models an extensive study of the impact of interconnect failures on the performance of interconnect infrastructures for multi-core chips will be undertaken. From this systematic exploration inherently fault resilient multi-core architectures can be identified, which will have negligible or marginal effects due to interconnect faults. Furthermore, novel NoC architectures depending upon application-specific workloads with high levels of error resilience will be developed based on complex network theory. These novel fault-tolerant NoCs will be compared for performance with more traditional fault-tolerant techniques based on adaptive routing and ECCs to establish

the relative advantages of the proposed complex network-based approach. As an extension, conventional fault-tolerance techniques will be implemented in this environment which will further enhance the performance of the emerging NoCs in the face of inherent failures related to technology.

## 5.3 Summary

NoC has emerged as an enabling solution for integration of huge number of embedded cores on a single die. However, the limitations of conventional metal/dielectric based interconnects in a multi-hop NoC need to be addressed using an amalgamation of novel interconnect technology and efficient architecture design. The inherent unreliability of heterogeneous technologies on a single die needs to be addressed from a fundamental perspective of fault-tolerant architecture design and in-built error-recovery mechanisms. Through such careful design future NoCs will be able to deliver the target performance demands of on-chip interconnection infrastructures.

## 5.4 Reference

- 1. K.K.O et al., "The feasibility of on-chip interconnection using antennas," Proc. of IEEE/ACM International Conference on Computer-Aided Design, 2005. ICCAD-2005, pp. 979-984.
- 2. S. Lin & D. J. Costello, Error Control Coding: Fundamentals and Applications, Prentice-Hall, 1983.
- 3. R. Albert, H. Jeong and A. Barabási, "Error and Attack Tolerance of Complex Networks", Nature, Vol. 406, July 2000, pp. 378-382.
- 4. H. Zhu, P. P. Pande, C. Grecu, "Performance Evaluation of Adaptive Routing Algorithms for achieving Fault Tolerance in NoC Fabrics," Proceedings of 18th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2007, July 9th 11th, 2007.
- A. Ganguly, P. Pande and B. Belzer, "Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NoC Interconnects", IEEE Transactions on VLSI (VLSI) Vol. 17, No.11, November 2009, pp. 1626-1639.

# **Appendix A**

## **Publications**

Following is a list of publications published in reputed journals and conferences during the course of this research.

### **Book Chapters:**

 Partha Pratim Pande, Cristian Grecu, Amlan Ganguly, Andre Ivanov, and Resve Saleh, "Test and Fault Tolerance of NoC Infrastructures", In *Networks-on-Chips: Theory and Practice*, Fayez Gebali, Haytham Elmiligi, and M.Watheq El-Kharashi (eds.), Taylor & Francis Group LLC - CRC Press.

## Journals:

- Amlan Ganguly, Kevin Chang, Sujay Deb, Partha Pande, Benjamin Belzer, Christof Teuscher, "Scalable Hybrid Wireless Network-on-Chip Architectures for Multi-Core Systems", IEEE Transactions on Computers (TC), June, 2010, accepted for publication.
- Amlan Ganguly, Partha Pande, Benjamin Belzer, "Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NoC Interconnects", IEEE Transactions on VLSI (TVLSI) Vol. 17, No.11, November 2009, pp. 1626-1639.
- Amlan Ganguly, Partha Pande, Benjamin Belzer, Cristian Grecu, "Design of Low power & Reliable Networks on Chip through joint Crosstalk Avoidance and Multiple Error Correction Coding", Journal of Electronic Testing: Theory and Applications (JETTA), Special Issue on Defect and Fault Tolerance, June 2008, pp. 67-81.

 Partha Pande, Amlan Ganguly, Haibo Zhu, Cristian Grecu, "Energy Reduction through Crosstalk Avoidance Coding in Networks on Chip", Journal of System Architecture (JSA), Vol. 54/ 3-4, March-April 2008, pp.441-451.

#### **Conferences:**

- Sujay Deb, Kevin Chang, Amlan Ganguly and Partha Pande, "Comparative Performance Evaluation of Wireless and Optical NoC Architectures", Proceedings of IEEE International SOC Conference (SOCC), 27<sup>th</sup>-29<sup>th</sup> September 2010.
- Sujay Deb, Amlan Ganguly, Kevin Chang, Benjamin Belzer, Deuk Heo, "Enhancing Performance of Network-on-Chip Architectures with Millimeter-Wave Wireless Interconnects", Proceedings of IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2010.
- Partha Pande, Amlan Ganguly, Kevin Chang, Christof Teuscher, "Hybrid Wireless Network-on-Chip: A New Paradigm in Multi-Core Design", *invited paper*, Second International Workshop on Network-on-Chip Architectures (NoCArc), December 12, 2009.
- Amlan Ganguly, Kevin Chang, Partha Pratim Pande, Benjamin Belzer and Alireza Nojeh, "Performance Evaluation of Wireless Networks on Chip Architectures", Proceedings of the IEEE International Symposium on Quality Electronic Design (ISQED), 16<sup>th</sup>-18<sup>th</sup> March 2009.
- Partha Pande, Amlan Ganguly, Benjamin Belzer, Alireza Nojeh, Andre Ivanov, "Novel Interconnect Infrastructures for Massive Multicore Chips", Proceedings of IEEE Symposium on Circuits and Systems (ISCAS), May, 2008, pp. 2777 - 2780.

- A. Nojeh, P. Pande, A. Ganguly, S. Sheikhaei, B. Belzer and A. Ivanov, "Reliability of wireless on-chip interconnects based on carbon nanotube antennas," Proceedings of IEEE International Mixed-Signals, Sensors, and Systems Test Workshop (IMS3TW) June 2008, pp. 1-6.
- Amlan Ganguly, Partha Pande, Benjamin Belzer, Cristian Grecu, "Addressing Signal Integrity in Networks on Chip Interconnects through Crosstalk-Aware Double Error Correction Coding", Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2007, May, 2007, pp. 317 - 324.
- Partha Pande, Amlan Ganguly, Brett Feero, Cristian Grecu, "Applicability of Energy Efficient Coding Methodology to Address Signal Integrity in 3D NoC Fabrics", Proceedings of IEEE International ON-line Test Symposium (IOLTS), July, 2007, pp. 161-166.
- Partha Pande, Amlan Ganguly, Brett Feero, Benjamin Belzer, Cristian Grecu, "Design of Low Lower & Reliable Networks on Chip through Joint Crosstalk Avoidance and Forward Error Correction Coding", Proceedings of IEEE Defect and Fault Tolerance in VLSI Systems (DFT), 2006, pp. 466 – 476.
- Partha Pande, Haibo Zhu, Amlan Ganguly, Cristian Grecu, "Energy Reduction through Crosstalk Avoidance Coding in NoC Paradigm", Proceedings of IEEE EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools (DSD) 2006, pp. 689 – 695.
- Partha Pratim Pande, Haibo Zhu, Amlan Ganguly, Cristian Grecu, "Crosstalk-aware Energy Reduction in NoC Communication Fabrics", Proceedings of IEEE International SOC Conference (SOCC), 2006, pp. 225 – 228.