This paper presents an on-chip network for a runtime reconfigurable System-on-Chip. The network uses packet-switching with virtual channels. It can provide guaranteed services as well as best effort services. The guaranteed services are based on virtual channel allocation, in contrast to other on-chip networks where guarantees are provided by time-division multiplexing. The network is particularly suitable for systems in which the traffic is dominated by streams.
Introduction
Advances in silicon technology bring, among others, two problems that chip designers have to face -a high design complexity and a signal integrity problem [1] , [2] , [3] . The design complexity problem is the concern that the complexity of a system that fits on a single chip is getting so high that the time needed to design a completely new system using the current design methods and tools is becoming impractical long. For that reason it is foreseen that future System-onChip (SoC) will be based mostly on pre-designed IP blocks relying on extensive IP reuse. To be practical, such a design methodology needs to be complemented with a unified and simple solution for interconnecting and integrating IP blocks in a system. Currently on-chip buses offer such a solution, but since the bus bandwidth does not scale with the number of IP cores on the chip it will soon become a system bottleneck.
The second problem, the signal integrity problem, is due to the fact that with the technology scaling transistors get smaller and faster while wires get thinner and slower. Wire delay becomes proportional to the wire length and a few long wires on a chip can degrade the performance of the entire chip. Thus, the on-chip interconnects become a limiting factor for SoC performance and their physical parameters must be taken into account at an early design stage. The current approach of ad-hoc on-chip wiring becomes questionable especially for the global (inter IP) on-chip interconnects, which are longer and slower.
A possible approach for coping with the two problems is to use a Network-on-Chip (NoC) [2] [3], which is a light weight communication network built on the chip. This network replaces the traditional on-chip bus and provides a scalable high-bandwidth solution for interconnecting the increasing number of IP blocks. On the other hand, when a regular network topology is used, the global on-chip wires are short and well structured which helps coping with the signal integrity problem.
The performance of a network strongly depends on the characteristics of the data traffic generated by the system. Therefore a NoC cannot be evaluated separately, but has to be considered in the context of its operational environment. In this paper we present a network-on-chip solution for a run-time reconfigurable SoC used in mobile multimedia devices. The dynamic nature of such a system requires a flexible NoC solution. Since many of the system applications have real-time requirements, the system and the network have to be predictable. Network predictability is achieved by providing guaranteed network services. The proposed network provides guaranteed throughput (GT) and best effort (BE) services.
We discuss the system and the applications running on it and construct a model of the data traffic in such system. The model is then used to evaluate the proposed network. We perform simulations of the network behaviour the results of which show that the network can handle the system traffic and can provide the guarantees requested by the applications The paper is organized as follows: Section 2 gives an overview of the SoC where the proposed network is used. Section 3 presents the network-on-chip and explains how guaranteed services are provided. Section 4 presents simulation results showing that the network can handle the system traffic and can provide the required traffic guarantees. Section 5 presents related work.
System overview
The network proposed in this paper is intended for Multiprocessor System-on-Chip (MPSoC) for mobile multimedia devices. In such a system streaming applications dominate the demand for computation and communication capacity and high performance has to be achieved with a limited power budget. The operational environment requires for a dynamic system capable to adapt it self at run-time.
The MPSoC consists of a number of processing elements (PEs) arranged in a matrix and connected in a grid by a NoC. A PE consists of a processor with its own code and data memory that can operate independently. Several PEs are general purpose processors shared between the control oriented tasks running in the system. Most of the PEs are domain specific -simplified processors designed to perform fast and efficiently the computationally intensive algorithms in certain application domains, such as baseband processing, audio/video processing, etc. An example of a domain specific PE is the processing tile proposed by Heysters et al [4] . For 0.13 μm technology the area of the tile is 2 mm 2 . On a typical chip of 14x14 mm near 100 such PEs can be fitted. The next technology generations will allow hundreds of domain specific PEs to be fitted on a single chip.
The system exploits the coarse grain parallelisms in streaming applications. The streaming applications are partitioned in a set of processes to run on separate PEs. The data exchanged between the processes are transported on communication channels handled by the NoC. After partitioning, the streaming applications typically have a structure like that shown in Figure 1 . Such a structure is observed for the applications in the domain of base-band processing for wireless communications, like HiperLAN/2, UMTS, Bluetooth and DRM [5] [6] . A similar observation for media processing applications is made by Dally et al [7] . Two parts can be distinguished in the structure in Figure 1 a processing part and a control part.
The processing part does the actual stream processing. It has a simple pipeline structure. The data from the stream flow through the pipe and each process applies some transformation on it. The processes constructing the pipe (denoted as P 1 , P 2 , …, P n ) typically implement DSP algorithms, e.g. FFT, DCT, FIR etc. These algorithms are small, computationally intensive and specific for the application domain. Therefore, they are suitable to run on domain specific PEs. The communications between the processes in the pipeline carry the main data stream and therefore require high throughput. Because the pipe usually works under real-time constraints, the communications there are also real-time, hence requite GT services. Therefore the communication traffic generated by the processing part of the applications is high throughput GT traffic.
Control
Processing part Control part Figure 1 Generalized structure of a streaming application
The control part of the application comprises all the application tasks dealing with the application organisation and control, run-time adaptation and reconfiguration. Because of the reactive nature of these tasks they run not so often, require lighter computation and therefore are more suitable to run on shared general purpose PEs. The data traffic between the control and the processing part consists of infrequently exchanged short control messages, like event notifications, status and parameters exchange. These are not real-time communications, hence require BE services. Therefore the communication traffic generated by the control part is low throughput BE traffic. As a rough estimate for the applications we consider (base band processing), we observe that 90% of the traffic is GT traffic and 10% is BE traffic.
In summary, the traffic in the stream-processing system for mobile multimedia devices consists of: i) 90% high throughput GT traffic coming from streaming data, ii) 10% low throughput BE traffic coming from short control messages.
The system is dynamic and applications are started and stopped at run-time. When a new application is started, PEs are allocated to its processes and network resources are allocated to its communication channels are. The system is centralized -a central authority, which we call Configuration Manager (CM), starts and stops applications and manages the system resources (PEs, network, etc.) at run-time. The CM runs as a high priority task on one of the general purpose processors. By sending control messages on the network to PEs the CM can control and configure each PE separately while the rest of the system is running.
The configuration messages sent by the CM to the PEs in the system also contribute to the BE traffic in the system. The size of these configuration messages depends on the configuration space of the PEs. For example the total configuration space of the processing tile proposed by Heysters et al [4] is 2.6KB. Although this size is not small, configuration messages are generated infrequently, only when a new application is started (in period from seconds to hours), and therefore do not contribute significantly to the system traffic -our estimate is for less then 0.1% of the system traffic.
Network
Here we present the on-chip network we propose for interconnecting the PEs in the system described in the previous section. It is a packet switching virtual channel network that provides GT as well as BE services.
Each PE in the system is equipped with a network router which it uses for inter processor communications. The routers are connected in a mesh by full-duplex channels build of two unidirectional channels -one in each direction. We refer to the unidirectional channels as physical channels. The network is a virtual channel network [8] . In a virtual channel network there are several virtual channels (VCs) on each physical channel and data are transported on the VCs. The physical channel is time shared between its VCs on a cycle-by-cycle basis in a round-robin fashion. Cycles are only used by a VC when data is transmitted on it; the idle VCs do not consume cycles. The VCs are separately buffered in FIFOs at the router inputs. In our network, each physical channel is shared between 4 VCs, this number being motivated by the trade-off between performance and buffer area of a virtual channel router studied by Dally [8] .
The physical channel is shared in a round-robin fashion, which means that the VCs equally share the bandwidth of the physical channel. The bandwidth is shared only between the VCs that currently transport data; the VCs that are idle do not consume bandwidth. This is the worst case throughput for the v VCs; whatever traffic load is applied to the v virtual channels their throughput will not go below TH min . Since in our network the number of VCs per network channel is 4, the values that TH min can take are b, b/2, b/3 or b/4.
In traditional virtual channel networks VCs are allocated to packets dynamically by the routers [8] . The number of VCs on a network channel that are currently used depends on the current traffic and cannot be determined. Therefore, no throughput bounds can be given for a VC. In contrast, in our network VC are statically allocated to communication channels. The allocation is done centrally by the CM and the number of occupied VCs on a network channel is known. Therefore, we can give a lower bound on the VCs throughput.
The network provides GT services on a connection basis in the following way. A route is found over VCs between the source and the destination of a GT communication. The chain of VCs on which the route traverses the network is called a connection. The VCs in the chain are reserved and used only by the connection. The minimal throughput of the connection is determined by the VC with minimal throughput bound. We can a guarantee a lower throughput bound to a VC, therefore we can guarantee a lower throughput bound to a connection. To guarantee a minimal throughput bound of TH R to a connection, according to (1) the following must hold for all the VCs in the connection
Given a lower bound TH min on the connection throughput, the maximal latency T max of a message of length L bits traversing the connection (in a wormhole manner) is [16] min max TH L t N T r + * = (3) where, N is the hop count or the number of network channels traversed by the message, and t r is the delay of the message per router or the time that the head of the message spends waiting in a router. In our design t r is at most v clock cycles (v has the same meaning as in equ.
(1)). Let T c be the network clock period and w the network channel width. Taking into account equation (1) and that b=w/T c , then equation (3) 
This is the maximal latency that a message of length L bits, travelling distance N hops experiences on a GT channel of TH min =b/v.
Our network uses source routing -a network address specifies not only the destination PE, but also the route to be followed. The route is described as e sequence of physical channels and their VCs to be taken. The destination address is carried by the packet header. When a header is send by the source PE, it is handled by the network and transported to the destination PE following the route described by the destination address. Along its way in the network the header allocates the VCs it traverses. The data form the packet body follows the allocated VCs and reaches the destination PE without need for additional control information. At the end, the packet tail releases the VCs. When an application is started the CM reserves VCs for the connections required by the application and gives the connections description to the PEs in the form of network addresses. When the application starts its PEs send packet headers and open the GT connections. When the application is stopped packet tails are sent and the connections are closed.
The BE services are provided by using shared VCs. The VCs allocated by the CM to a BE communication channel can be shared between sever BE communication channels. Sharing entails packet blocking, which makes the throughput prediction difficult. Therefore no guarantees are given. The BE data are sent on packet-by-packet basis. Since the VCs for BE connections are statically allocated and shared, a behaviour similar to that of a wormhole network should be expected. The wormhole networks do not perform well for intensive traffic consisting of long packets. But as we saw in the previous section, the BE traffic in our system is of low intensity and consists mainly of short messages (several bytes), therefore no performance problems are expected.
A network router has been implemented and synthesised for 0.13 μm technology. The router has an area of 0.18 mm 2 and can operate at a maximal frequency of 500 MHz [9] . Almost half of the router area is consumed by the buffers. More implementation details can be found in Kavaldjiev et al [10] and results for the energy consumption are presented by Wolkotte et al [11] .
Simulation
To validate the network a cycle-accurate simulation is performed. A mesh network of size 6-by-6 is simulated using a traffic model with the characteristics presented in Section 2.
Setup
The communication traffic pattern used in the simulation is derived as follows. The nodes of a directed graph with ring topology are scattered over the nodes of a 6-by-6 mesh network. The graph contains 36 nodes and each graph node is placed on a separate network node. The edges between the placed graph nodes define the communication channels between the network nodes. The ring communication pattern is representative for the GT traffic of the streaming applications, because a large ring graph can be seen as a serial connection of many short pipeline graphs. We consider random scattering as worst case strategy for mapping applications on the system. In a real system the application mapping will be done such that the communication locality is maximized, which leads to better network conditions. Each node in the network generates both types of traffic, GT and BE, transmitted on two separate communication channels following the ring communication pattern (a communication channel in the ring graph is substituted by one GT and one BE channels). Although the ring pattern is not the most realistic traffic pattern for BE traffic, we use it because the purpose of the BE traffic in this simulation is only to disturb the GT traffic and to create heavy traffic conditions. We shall see that what ever the BE traffic load is the GT traffic cannot be disturbed. During the simulation the GT traffic is kept constant while the BE traffic is gradually increased to the point of network saturation. During the simulation statistics are collected for the message latencies. The aim is to show that the latency guarantees given by the network to the GT traffic are not violated by any traffic condition.
The GT traffic generated by a network node has the characteristics of the traffic generated by a HiperLAN/2 receiver [5] -a typical high throughput baseband processing application. Every 4 μs a new message of size 256 Bytes is generated, which equals 512 Mbit/s average throughput per node or 18.4Gbit/s total aggregated throughput for the 36 nodes in the system. The BE traffic generated by a node consists of packets with 10 Bytes payload. The packet generation period is gradually reduced in order to increase the BE traffic intensity.
We guarantee that GT message latency is at most 1/3 of the message generating period 4 μs or <1.3 μs. In this way in a real system a PE will spent at most 1/3 of the time transmitting messages, at most 1/3 of the time receiving messages and the rest of the time will be for message processing. On 16 bit network channels (w = 16) and network clock period T c = 3 ns latency <1.3 μs can be guaranteed by requesting GT connections of throughput TH R b/3 (v=3). The maximal message distance in a 6-by-6 mesh network is N max =10 hops. The length of the GT messages is L=8*256B=2048 bits. According to equation (4) the maximal GT message latency is 414 clock cycles or 1.242 μs. To provide GT connections of throughput b/3, the routing is done such that at most 3 VC are used per network channel (v=3). Therefore, with this simulations setup we expect no GT message latency to exceed the given latency bound of 414 clock cycles. Guarantee (TH min ) GT mean GT max BE mean Figure 2 Message delay of the GT and BE traffic vs. BE load for 6-by-6 network When the offered BE load is low the latency of the GT packets is smaller than the guaranteed latency. The reason is that the GT traffic utilizes the bandwidth unused by the BE traffic. The latency of the GT packets is higher than the latency for the BE traffic because the GT packets are larger (256B) than the BE packets (10B). With the increase of the BE load the latency of the GT traffic increases too and at some point it saturates. Further increase of the BE load increases the GT mean latency but the GT maximum latency does not increase and never exceeds the guaranteed latency. The GT maximum latency never reaches the latency bound, because the guarantee given by equation (1) is for worst case conditions when all v VCs constantly transmit data, while in our simulation setup the GT channels transmit data only 1/3 of the time. Thus, even beyond the point of network saturation for the BE traffic there is no GT packet that experience latency higher than 414 cycles -the packet latency is bounded according to the given guarantees.
Simulation and results
The GT traffic offered to the network per PE is 512 Mbit/s or 0.09 of the channel capacity w/T c =5.3Gbit/s. In Section 2, the BE traffic in baseband processing applications is estimated to be 10% of the total traffic while the remaining 90% is GT traffic. Thus, the intensity of the BE traffic expected in a real system per PE is about 57 Mbit/s or 0.01 of the channel capacity, which means that the network will operate in the very left part of the graph. According the simulation results the network saturates when the BE load per PE reaches about 0.12 of the channel capacity or 640 Mbit/s -more then ten times the expected BE load.
Related work
The RAW processor is a parallel architecture that exploits the applications instruction level parallelism [12] . For interconnecting its processing components it incorporates two types of networks dynamic and static that handle the different classes of traffic. The dynamic network is a dimension-ordered wormhole network, while the static network implements time-division multiplexing. In our solution both types of traffic are handled by a single virtual channel network.
ETHEREAL is a packet switching NoC solution based on time-division multiplexing (TDM) [13] . It provides both guaranteed and best-effort services and is targeted at general multimedia SoCs.
aSOC is a framework for on-chip communications in heterogeneous tiled architectures [14] . The proposed network implements a kind of advanced time-division multiplexing that can handle efficiently more complex communication patterns than the traditional TDM do. Instead of a simple timetable, each router there has a sequencer that allows more irregular switching behaviour. TDM (as used in aSOC and Ethereal) requires recomputation of a schedule for all the communications in the network even when only one communication link changes, which makes the system rather static. We avoid this by using virtual channels instead of TDM. We can add links incrementally without affecting the performance of already allocated links.
Wolkotte et al [9] proposes a circuit switching network which benefits small area and low energy consumption, while providing more flexibility than the traditional circuit switching solutions. Each network channel is divided into four lanes. Switching can be done at different granularity -from single lane to a whole channel. A network interface hides the real channel size from the application. Disadvantage of the circuit switching solution is that it does not support channel sharing and BE traffic. An additional network is required for configuring the switches and for carrying the BE traffic.
The SPIN network is a wormhole network that uses adaptive routing and is based on a fat tree topology [15] . It is targeted at general multiprocessor systems. While good in performance the topology used for this network is not natural for plane layout. Thus it is not clear whether it can help in structuring the global onchip wiring.
Conclusion
This paper presented a network-on-chip for a runtime reconfigurable multi-processor system-on-chip. The system is used in mobile multimedia devices where most of the intensive applications are stream-processing applications. A model of the data traffic in the system is constructed and the network is simulated with this traffic model. Considering the communication locality, the simulated network conditions are worst case conditions. Nevertheless, the network manages to transport the system traffic and is able to provide the requested guaranteed services. The maximum message latency in the network never exceeds the guaranteed latency bound.
The proposed network is suitable for carrying the traffic in a real-time stream-processing system. The network provides guaranteed as well as best effort services. Data steams requiring guaranteed services are efficiently handled by network connections which reserve network resources, while in the same time best effort traffic is handled by allowing network resource sharing. Thus the same network handles both types of traffic. The network requires minimum configuration which is done partially and only where and when it is required. Therefore the network is suitable for dynamic systems where fast reconfiguration is required.
