# Evaluation of Multiple-Valued Packet Multiplexing Scheme for Network-on-Chip Architecture

Haque Mohammad Munirul<sup>†</sup>, Tomoaki Hasegawa<sup>†</sup> and Michitaka Kameyama<sup>‡</sup> Graduate School of Information Sciences, Tohoku University Aoba-yama 6-6-05, Aoba-ku, Sendai 980-8579, Japan <sup>†</sup>{topusumi,hase}@kameyama.ecei.tohoku.ac.jp, <sup>‡</sup>kameyama@ecei.tohoku.ac.jp

## Abstract

This paper presents an evaluation of multiple-valued packet multiplexing scheme for a Network-on-Chip (NoC) architecture. In the NoC architecture, data is transferred from one Processing Element (PE) to another PE through the routers in the form of a packet. A router, suitable for both the binary and the multiple-valued packets, is constructed using the Multiple Valued Source-Coupled Logic circuits. A packet is composed of flag, destination PE address and data fields. In the NoC architecture, packets are generated by microprogram control. In the proposed scheme, two binary packets are multiplexed if the destination PE addresses are the same. Based on address matching, packets are transferred from a source PE to a destination PE autonomously. As a result, the total number of packets can be reduced. The router is designed using 0.18µm CMOS design rule. HSPICE simulation results show that the delay of the router is significantly small for high speed packet transfer. Reduction of microprogram control storage is remarkable in the proposed scheme, because the data transfer can be done autonomously. The advantage is evaluated by simple analysis, and comparison with a conventional pipelined bus architecture is done.

# 1 Introduction

System-on-Chip (SoC) platforms, integrating a large number of computational logic and storage blocks on a single chip, are already into existence [1]. Because of the onchip physical interconnection complexity of such a complex system, the on-chip bus architecture evolved into a network architecture, namely Network-on-Chip (NoC) [2]. Researchers are mainly focussing on the NoC to meet the distinctive challenges of providing functionally correct, reliable operation for the interacting SoC macro modules.

As the components i.e the macro modules of the SoC

are also becoming complex, the interconnection topology between Processing Elements (PEs) is also likely to face similar challenges. In this paper, we present a multiplevalued packet multiplexing scheme for a NoC architecture consisting a micronetwork. It has double transmission lines and routers, where each PE is connected to a router. Two binary packets are multiplexed into a single multiple-valued (MV) packet if the destination PE addresses are the same. The multiplexed packet is transferred between the PEs using a single transmission line. Thus, the total number of packets in the micronetwork can be reduced and the throughput can be increased. A simple routing protocol is used to make the router circuit simple. The router is designed using Multiple-Valued Source-Coupled Logic (MVSCL) circuits [3].

This paper describes the VLSI implementation and evaluation of the router using  $0.18\mu$ m CMOS standard design rule. The contribution of the proposed scheme to reduce the size of the microprogram control storage is discussed using the above evaluation result. The condition for area reduction in comparison with a conventional pipelined bus architecture is derived mathematically and different cases are analyzed. Comparison results show that the size of the microprogram control storage can be reduced remarkably in all the cases using the proposed scheme when a large number of PEs are used.

# 2 NoC architecture

In the proposed NoC architecture [4] the micronetwork is constructed using double transmission lines (left $\rightarrow$ right and right $\rightarrow$ left) and routers, as shown in Fig. 1. Each router is directly connected to a Processing Element (PE). To achieve a simple router design, we introduce a linearly ordered node number according to the layout distance for each node on the micronetwork. Each PE and the directly connected router have the same address. Data is transferred between the PEs in the form of a packet.

The processing sequence is given by a Control/Data





Figure 1. NoC architecture



Figure 2. Block diagram of the microprogram control unit

Flow Graph (CDFG) as shown in Fig. 1, where each node of the CDFG corresponds to an arithmetic operation and each edge corresponds to node-to-node packet transfer. For a given CDFG, let us assume that scheduling and allocation are done in advance in order to avoid packet collision. Each node of the CDFG is allocated to a PE. Node-to-node i.e PE-to-PE packet transfer is done through the micronetwork.

In a source PE, packets are generated by microprogram control. The microprogram control unit is composed of a Microprogram Memory (MM) (for control signal storage) and a control circuit (for generating control signals) as shown in Fig. 2. The packet information is stored in the MM. Once a PE sends a packet in the micronetwork, the direction of the packet transfer is determined autonomously by magnitude comparison of the addresses. If the addresses of the packet and the router match, then the packet is received by the corresponding PE or else the packet is transferred to an adjacent router in a pipelined manner.

| Flag<br>(2 bit)         | Address<br>(log <sub>2</sub> N bits) | Data |  |  |
|-------------------------|--------------------------------------|------|--|--|
|                         |                                      |      |  |  |
| Header Flag information |                                      |      |  |  |
| Value                   | Packet type                          |      |  |  |
| 0                       | Invalid Packet                       |      |  |  |
| 1                       | Component Packet1                    |      |  |  |
| 2                       | Component Packet2                    |      |  |  |
| 3                       | Multiplexed Packet                   |      |  |  |

Figure 3. Packet format

## 3 Multiple-valued packet multiplexing

In order to avoid a packet collision, two packets are scheduled such that they do not reach a router simultaneously. However, if the destination PE addresses of the packets are the same, they are multiplexed in the router. The packet, which reaches the router earlier, is scheduled to wait for the other packet to be multiplexed inside the router. The multiplexing is done by the linear summation of the packets and as a result the multiplexed packet becomes a multiplevalued (MV) one. The MV packet is transferred using only one transmission line. The multiplexing is implemented by linear summation just by wiring without any active devices in current-mode logic. However, linear summation of the two packets must hold the arithmetic and logic information in the destination PE.

#### 3.1 Packet format

The packet consists of data and header fields. The header contains flag and destination address information. The flag determines the type of the packet as shown in Fig. 3. The Component Packet1, having the flag value 1, is scheduled to wait inside the router to be multiplexed. The Component Packet2, having the flag value 2, is scheduled to be transferred to the router where the Component Packet1 is waiting. If the packets are multiplexed, the flag value is changed to 3. The simple packet format leads to simple router design. A total of  $(\log_2 N + 2)$  bits are required to generate the header field, where N is the total number of the PEs.

#### 3.2 Example of packet multiplexing for FIR filter

As an example of parallel processing, let us consider parallel FIR filter operation. The CDFG of an FIR filter is shown in Fig. 4(a). Let us consider two operations,  $O_1$ and  $O_2$  that are to be done in a parallel manner, where  $O_1$ and  $O_2$  are denoted by white nodes and black nodes, respectively. We assume that, the PEs are arranged in a manner as shown in Fig. 4(b) and each operation is scheduled to be





Figure 4. Example of parallel processing based on packet multiplexing

performed within 10 steps. The left side of Fig. 4(b) shows a mapping example of the FIR filter on to the PEs under time constraint for a non-multiplexing scheme. The dotted circles on the left figure represent simultaneous data transfer which is necessary to satisfy the time constraint. However, without increasing the transmission lines, this kind of mapping is impossible. Any increase in the timing constraint or the transmission lines will directly lead to the increase of the MM area. On the other hand, the right side of Fig. 4(b) shows a mapping example of the FIR filter on to the PEs under the time constraint for a packet multiplexing scheme. The dotted circles on the right figure show that two packets having the same destination are multiplexed and thus the timing constraint is not violated. As there is no requirement of an extra transmission line, the MM area becomes smaller than that of a non-multiplexing scheme.

#### 3.3 Router implementation

A router suitable for both the binary and the MV packets is shown in the block diagram of Fig. 5. The circuit design of a Multiple-Valued Latch (ML), a Multiple-Valued Pass Switch (MPS) and a Functional Pass Switch (FPS) of Fig. 5 are described in [3]. Comparison with a binary router based on HSPICE simulation was also discussed in [3]. The router is implemented using  $0.18\mu$ m CMOS standard de-



Figure 5. Router circuit block diagram

sign rule. The packet header size is 6 bit (2 bits for the flag and 4 bits for the address fields). Figure 6 shows the layout of the router. The router area is  $95\mu$ m×130 $\mu$ m, which is used for evaluation purpose in the following section. As a packet is composed of data and header fields, both the dataand header-related circuits are provided in the router. The Header-Related Circuits (HRC) are address and flag comparators, latches and switches. It is evaluated from the designed layout that the HRC area is 102M, where *M* is the area of a binary SRAM. Figure 7 shows the simulated waveforms of the router. In the figure, the router input is one bit of the address and the output is the address comparison result.

## 4 Comparison and advantages

This section discusses the advantage of the NoC architecture based on multiple-valued packet multiplexing over a pipelined bus architecture using the above evaluation result.

#### 4.1 Pipelined bus architecture

Conventionally, a pipelined bus architecture is used to implement a parallel VLSI processor. Let us consider the





Figure 6. Layout of the router



Figure 7. Simulated waveforms of the router

architecture shown in Fig. 8. Each PE is directly connected to a Switch Block (SB), which consists of switches and pipeline latches. The SBs are arranged in a linear array. There are two transmission lines, which are used for left $\rightarrow$ right and right $\rightarrow$ left data transmission. Data is transferred according to microprogram control in a bit-parallel manner. All the pipeline latches are synchronized with a system clock. Scheduling and allocation are determined in advance so that data collision can be avoided.

#### 4.2 Area comparison between the router and the SB

The block-diagram of the SB, composed of 4 switches  $SW_1 \sim SW_4$  and 2 pipeline latches, is shown in Fig. 8. The switches are controlled by the control signals  $C_1$ ,  $C_2$ ,  $C_3$  and  $C_4$ , which are stored in the Microprogram Memory (MM) as shown in Fig. 2.

In the NoC architecture a router is used in stead of an SB. The router has switches, which are controlled by the address and the flag comparators. The router becomes large



Figure 8. Pipelined bus architecture

because of HRC in comparison with an SB. If the total number of the PEs increases, the HRC area also increases. For a total of N PEs, the total area of the header-related circuits,  $A_{HRC\_total}$  is given by the following equation:

$$A_{HRC\_total} = N \times \log_4 N \times 102M. \tag{1}$$

#### 4.3 MM area comparison

Let the total area of the MM in the pipelined bus architecture is  $A_{data}$ . The total area of the MM in the NoC architecture with Packet Multiplexing is  $A_{PM}$ . Overall area reduction is possible if the following equation is satisfied:

$$A_{HRC\_total} + A_{PM} \le A_{data} \tag{2}$$

Figure 9 shows a single packet transfer between the PEs. In step 1, a single packet from PE<sub>1</sub> is transferred to the router R<sub>1</sub>. The initialization timing control is required in this step. In the next step and onwards, no further timing control is required. Once the packet is in the micronetwork, the direction of the packet transfer is controlled by magnitude comparison of the addresses. The packet is transmitted towards the left or the right direction in a pipelined manner. A  $(\log_2 N + 2) \times N$ -bit control signal must be stored in the MM for this purpose. We assume that 1-bit control signal is stored in a binary SRAM. If the area of the SRAM is denoted as M, the  $A_{PM}$  is given by the following equation:

$$A_{PM} = MN(\log_2 N + 2) \tag{3}$$

On the other hand, timing control is required for every single step in the pipelined bus architecture. A 4-bit control





Figure 9. Single packet transfer

signal must be stored in the MM for each PE to control the 4 switches of Fig. 8. In the worst case, a data is transferred from PE<sub>1</sub> to PE<sub>N</sub> in a pipelined manner. Thus, the  $A_{data}$  is given by the following equation:

$$A_{data} = 4MN^2 \tag{4}$$

## 4.4 Effect of packet multiplexing

In the packet multiplexing scheme, two binary packets can be multiplexed if the destination PE addresses of both are the same. Let us assume that the ratio between such packets and all the packets in the micronetwork is x. The range of x is  $0\sim1$ . Thus, the number of the packets in the micronetwork is reduced and the throughput is increased. In other words, the provided transmission lines are able to fit more packets. In the best case, when x = 1, the capacity of the transmission lines is doubled.

### 4.5 Condition for area reduction

Let us assume that in the NoC architecture, the throughputs of the micronetwork and the pipelined bus are P and D, respectively. In the Pipelined Bus (PB) architecture, the number of data on the bus is D. If there is no packet to be multiplexed in the micronetwork (x = 0), then P becomes equal to D. In the best case, when x = 1, P becomes twice of D. Thus, the relation between P and D can be given with the following equation:

$$P = (1+x) \times D \tag{5}$$

Under normalized throughput, the condition for area reduction of the MM is given using Eqs.  $(1)\sim(5)$  as follows:



Figure 10. MM area comparisons

 $\Rightarrow 102 \text{MN}(\log_4 \text{N}) + (\log_2 \text{N} + 2) \text{MN} \le 4 \text{MN}^2 \times \frac{P}{D}$ 

 $\Rightarrow 4(1+x)\text{MN-104M}(\log_4\text{N})\text{-}2\text{M} \ge 0; \text{N} \ge 2$ 

Figures 10 shows the MM area comparisons for different values of x. It is clear from the figure that significant area reduction is possible using the proposed scheme, especially when N is large. Even if there is no packet to be multiplexed (x=0) in the micronetwork, significant area reduction is still possible as shown in the figure. Table 1 shows the minimum number of the PEs in each case in order to satisfy the above condition.

Table 1: Minimum number of the PEs to satisfy the area reduction condition

| Case  | x    | Minimum number of PEs |
|-------|------|-----------------------|
| Case1 | 0.25 | 70                    |
| Case2 | 0.50 | 55                    |
| Case3 | 0.75 | 45                    |
| Case4 | 1.00 | 40                    |

## 5 Conclusion

Network-on-Chip (NoC) is emerging as a viable interconnection architecture for SoC platforms. Our target is to extend the NoC concept to a low granularity level, such as a functional unit level. We believe that in near future the interconnection complexity within the functional units will impose serious bottleneck. Therefore, as a start point, we consider a very simple intra-chip micronetwork, where the PEs are horizontally arranged on a linear array. In this paper the evaluation of a multiple-valued packet multiplexing scheme for the proposed NoC architecture is presented. The proposed scheme has significant advantage in reducing the area of the microprogram control storage and there by increase parallelism. In future we shall extend our proposed concept to a relatively complex micronetwork such as mesh array, octagon etc. We believe that in the coming billion-



transistor era, the packet multiplexing scheme will open up a new paradigm in SoC design.

## References

- P. Magarshack and P.G. Paulin, "System-on-Chip beyond the Nanometer Wall", Proc. Design Automation Conf. (DAC), pp. 419-423 (2003).
- [2] L. Benini and G. De Micheli, "Networks on Chips: A New Soc Paradigm", IEEE Computer, vol. 35,no. 1, pp. 70-80 (2002).
- [3] Tomoaki Haswgawa, Yuya Homma and MichitakA Kameyama, "Multiple-Valued VLSI Architecture for Intra-Chip Packet Data Transfer", Proc. of the 35th IEEE Intl. Symp. on Multiple-Valued Logic, pp. 114-119 (2005).
- [4] Y.Homma, M.Kameyama, Y.Fujioka and N.Tomabechi, "VLSI Architecture Based on Packet Data Transfer Scheme and Its Application", 2005 IEEE International Symposium on Circuits and Systems, pp.1786-1789 (2005).