An Optimized Network-on-Chip Design for Data Parallel FFT1  by Xu, Thomas Canhao et al.
Procedia Engineering 30 (2012) 311 – 318
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2012.01.866









International Conference on Communication Technology and System Design 2011 
An Optimized Network-on-Chip Design for Data Parallel 
FFT1 
Thomas Canhao Xuab*, Pasi Liljebergb, and Hannu Tenhunena 
aDepartment of Information Technology, University of Turku, 20014, Turku, Finland 
bTurku Center for Computer Science, Joukahaisenkatu 3-5 B, 20520, Turku, Finland 
Abstract 
In this paper, we propose an optimized Network-on-Chip (NoC) design for data parallel FFT applications. NoC based 
architecture is proposed for future multicore processors due to its scalability. FFT is widely used in digital systems. 
The implementation of FFT on conventional architectures have been studied. However, the evaluation of data parallel 
FFT in a NoC platform has not been well addressed. We analyse data parallel FFT in terms of traffic patterns and 
propose an optimized NoC design. Experiments show that, the execution time of our optimized design is 12.13% 
faster than the original. 
 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of  ICCTSD 2011 
 
Keywords: FFT; Network-on-Chip; Data Parallel; Multicore; Optimization. 
1. Introduction 
Network-on-Chip (NoC) has been proposed from both academia and industry as a promising approach 
for future multi-core systems with hundreds or even thousands cores on a chip [1]. In NoC based design 
approach communication infrastructure is created beforehand and after that computational resources are 
mapped to it by using resource dependent interfaces. Processing Elements (PE) in a NoC are connected by 
routers (R) and network links, and data are transferred via network interfaces (NI), in the form of network 
packets. Figure 1 shows a mesh-based NoC with 16 nodes/tiles (N). Each PE contains a NI and a core 
with private L1 cache and shared L2 cache. The router includes a Routing Computation Unit (RCU), a 
Virtual Channel Allocator (VCA), a Switch Allocator (SA), a Crossbar Switch (CS), several Virtual 
Channels (VC) and input buffers. This modular approach also provides more efficient communication and 
higher bandwidth [1]. Intel (Intel is a trademark or registered trademark of Intel or its subsidiaries. Other 
                                                          
* Thomas Canhao Xu. Tel.: +358-2-333-8646; fax: +358-2-333-6950. 
E-mail address: {canxu, pasi.liljeberg, hannu.tenhunen}@utu.fi. 
 
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
312  Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318
 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 2 
names and brands may be claimed as the property of others.) has demonstrated an experimental x86 
microprocessor containing 48 cores on a chip. The chip implements a 4×6 2D mesh network with 2 cores 
per tile [2]. Tile-Gx, the latest generation of NoC from Tilera, brings 16 to 100 processor cores 
interconnected with a mesh on-chip network [3]. 
 
                                              Fig. 1. A 4×4 NoC using mesh topology.









The Fast Fourier transform (FFT) is a fast algorithm to compute the discrete Fourier transform. FFT is 
widely used in digital signal processing, solving partial differential equations and multiplication of large 
integers. Broadband wireless communication is a famous application field that heavily rely on FFT. In 
modern wireless communication, Orthogonal Frequency-Division Multiplexing (OFDM) is developed 
and widely used, due to the fact that OFDM is capable of coping with bad channel conditions (e.g. IEEE 
802.11a/g/n WLAN, IEEE 802.16 WiMAX, 3GPP Long Term Evolution and DVB-H for mobile TV) [4] 
[5] [6] [7] [8]. In OFDM, FFT is implemented on the receiver side and inverse FFT on the sender side to 
achieve efficient mmodulation and demodulation. Previous generations of wireless standards, e.g. IEEE 
802.11a, use an FFT of 64 points. Latest standards, e.g. 802.16, scale the FFT to the channel bandwidth. 
The allowed FFT subcarrier numbers are up to 2,048 in 802.16 and 8,192 in DVB-H, respectively. 
There are many FFT algorithm implementations, the most common FFT is the Cooley-Tukey 
algorithm [9]. Other algorithms have been proposed to reduce complexity of FFT, including reducing the 
required multiplications and additions. A famous algorithm is the split-radix FFT, which achieves the 
lowest arithmetic operation count [10]. Implementing FFTs on multi-processor systems has been studied 
in [11] and [12]. However, the implementation and optimization of data parallel FFT in a NoC platform 
have not been well addressed. In our paper, we analyse a data parallel FFT algorithm with on-chip traffic 
trace data, propose and discuss a novel optimized NoC architecture which aims to reduce the latency of 
long distance communications and improve the efficiency of data parallel FFT. To confirm our study, we 
model a NoC with 4×4 mesh, present the performance of the data parallel FFT with different NoC designs 
using a full system simulator. 
2. An Optimized Network-on-Chip Design 
We select a one-dimensional, radix-n, six-step FFT algorithm from [13]. There are two input data sets, 
one with n2 complex data points is to be transformed, and the other with n2 complex data points is 
referred as the roots of unity. The two data sets are organized and partitioned as n×n matrices, a partition 
of contiguous set of rows is assigned to a processor and distributed to its local cache. The six steps are: 
(1), Transpose the input data set matrix; (2), Perform one-dimensional FFTs on the resulting matrix; (3), 
Multiply the resulting matrix by roots of unity; (4), Transpose the resulting matrix. (5), Perform one-
dimensional FFTs on the resulting matrix; (6), Transpose the resulting matrix. The communication among 
processors can be a bottleneck in the three matrix transpose steps. During the matrix transpose step, a 
processor transposes a contiguous sub-matrix locally, and a sub-matrix from every other processor. The 
transpose step requires communication of all processors. It is shown in [14] that, fast data transfer 
between processors is the most dominant factor for this application (Step (1), (4) and (6)). Traffic 
hotspots and contentions could occur in an unoptimized system, and thus the overall performance is 
degraded. 
 
314  Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 4 
2.1. FFT Traffic Pattern 
Firstly, we need a detailed overview of the traffic. Figure 2 shows the network request rate of each PE 
when running FFT in a 16-core NoC under GEMS/Simics simulation environment. The detailed system 
configuration can be found in Section 3. In Figure 2, the horizontal axis is time, segmented in 216K-cycle 
percentage fragments. The traffic trace has 1.64M packets, with 21.6M cycles executed. The traffic is 
shown for all the 16 nodes. It is revealed that, 63.9% of data traffic are concentrated on five nodes (N0 
29.6%, N8 6.7%, N11 10.0%, N13 8.7% and N15 8.8%). There is a traffic peak for all nodes during the 
last stage of execution (around 80% of the time). Three nodes (N0, N8 and N15) have hotspot traffic in 
the beginning. The top point-to-point traffics are listed in Table 1. A small portion of source-destination 
pairs generated a sizable portion of the traffic, e.g. 3.13% of the pairs (8/256) generated 32.07% traffic. 
Notice that traffic between N0 and N11 contributed 10.97% of total volume. 
 
Fig. 2. Network request rate for 16-core NoC running FFT. The time is segmented in 216K-cycle/percentage. 
 
Table 1. Top Point-
to-Point traffics for 16-
core NoC running 
FFT.Source Node 
Destination Node Traffic Percentage 
0 11 7.43% 
0 4 4.11% 
0 3 3.94% 
15 11 3.66% 
13 6 3.63% 
11 0 3.54% 
0 12 3.49% 
8 11 2.27% 
 
315Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 5 
2.2. Network Latency 
Assuming XY deterministic routing, Equation 1 shows the access time (latency) required for a core-
core communication. The latency involves in-tile links (Between NI and router, LLink_delay1), router 
(LRouter_delay), tile-tile links (LLink_delay2) and the number of hops required to reach the destination (nhop).  
L = 2 × LLink_delay1 + ( nhop + 1 ) × LRouter_delay + nhop × LLink_delay2 (1) 
In order to evaluate the detailed number of cycles required for each of these metrics, we model the 
NoC according to Sun SPARC [15]. Each SPARC core with private L1 cache has an area of 14.45mm2 
with 65nm fabrication technology. We simulate the characteristics of a 16MB, 16 banks, 64-bit line size, 
16-way associative, 65nm L2 cache by HP CACTI [16]. Results show that, each cache bank, including 
data and tag, occupies 12.09mm2. We calculate that, a 5-port router is estimated to be 0.23mm2 under 
65nm. Hence the area for a tile of the NoC is around 14.45+12.09+0.23=26.77mm2. Considering a NoC 
with 16 tiles, the total area is about 428.32mm2, comparable with modern chip multiprocessors, such as 
Sun SPARC and IBM Power 7. In this research, we assume that each tile and each router is of a square 
shape, and thus the length of an edge is 5.17mm and 0.48mm for a tile and a router, respectively. 
We calculate the delay for the links between routers, NIs, cores and caches using Cadence Spectre, 
since the latency will be determined by the physical length of the link. For inter-router long links in 
voltage-mode transmission, the wire delay is significant. Repeaters are inserted to reduce the wire delay 
in long links over 0.5mm. The delays are calculated under 2.5GHz, with a cycle of 400ps. Notice that the 
in-tile links between NI and router are very short, e.g. Less than 0.5mm. The transmission can be 
completed within one cycle. For a router in the NoC, there are several parts (e.g. RCU, VCA, SA and CS) 
that will affect latency, depending on the number of pipeline stages. In our paper, we use a standard router 
of four pipeline stages. The tile-tile links are much longer than in-tile links. For example, the length of a 
tile-tile link connecting two routers is 4.69mm. In consideration of the synchronization of these pipelined 
long links, a data transfer requires 6 cycles. Assuming a packet sending from N0 to N11, it will go 
through N1, N2, N3 and N7, resulting 5 hops. Hence the latency is calculated as: 2×1+(5+1)×4+5×6=56 
cycles. 
By noticing that some source-destination pairs generate significant amount of traffic, we propose direct 
long links as an optimization method. The delays of intermediate routers are eliminated. For instance, a 
long link can be placed between N0 and N11 directly. In this case, the latencies of LRouter_delay for N1, N2, 
N3 and N7 will be eliminated. However, the number of links that can be routed in a NoC is limited by the 
router size and the area of the links. The limitation of router area is more significant than the long links 
itself. A typical router in a 2D mesh NoC has five ports to connect to five directions, namely, North, East, 
West, South and Local PE. This requires a 5×5 crossbar. Researches have shown that [17], crossbar 
occupies over 50% of the router area. A 7×7 crossbar doubles the area compared with 5×5. Therefore, 
adding too many long link can be undesirable. We note that the router of N0 has only 3 of 5 ports utilized 
(North, East and Local PE), which leaves 2 free ports. Other routers have free ports as well, e.g. N3 and 
N11 have 2 and 1 free ports, respectively. In our optimized design, we connect N0-N11 and N0-N3 with 
long links. Other pairs are not practical with long links, e.g. despite the fact that the communication 
between N0 and N4 is more frequent than N0 and N3, they are directly connected. Connecting N6 with 
N13 will require an expansion of the crossbar of router in N6, which is not favorable. Equation 2 shows 
the latency for a core-core communication with long links. 
LL = 2 × ( LLink_delay1 + LRouter_delay ) + LLonglink_delay (2) 
The latency of the long links (LLonglink_delay) between N0-N3 and N0-N11 will be much higher than 
LLink_delay2. We calculate that the length of link between N0 and N3 is 15.03mm. Based on the 
316  Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 6 
aforementioned wire delay model, a data transfer requires 18 cycles under 2.5GHz. Comparing with the 
original communication delay (2×1+(3+1)×4+3×6=36 cycles), the delay of N0-N3 long link is reduced to 
28 cycles. The savings are mainly from two routers. The length of N0-N11 long link is 24.89mm, 
resulting 32 cycles for a data transfer. The reduction of the communication latency for long links between 
N0 and N11 is higher than N0 and N3 (42/56=0.75 and 28/36=0.78 respectively). Taking into account of 
the 10.97% communication volume between N0 and N11, system performance can improve with our 
optimization. It is noteworthy that placing long links all over the NoC will incur higher design complexity 
of both hardware and software. 
2.3. Routing Algorithm 
Adaptive routing is used widely in off-chip networks, however deterministic routing is favorable for 
on-chip networks because the implementation is easier. We implement a modified XY deterministic 
routing algorithm to avoid deadlocks. When a PE P1 generates a request to another PE P2, the router of P1 
checks whether the long link between two PEs exists. The routing path is computed as P1 → P2 with 1 hop 
with long link connection. The same path is used for communication of P2 → P1. If the long link between 
two PEs does not exist, the request will follow the XY routing algorithm, i.e. the request will first travel 
along the X direction, then it will be routed in the Y direction. As aforementioned, we did not increase the 
number of ports in a router, therefore the router components are not modified. 
3. Experimental Evaluation 
3.1. Experiment Setup 
The simulation platform is based on a cycle-accurate NoC simulator which is able to produce detailed 
evaluation results. The platform models the routers and links accurately. We use a 16-core network which 
models a single-chip NoC for our experiments. A full system simulation environment with 16 nodes, each 
with a core and related cache, has been implemented. The simulations are run on the Solaris 9 operating 
system based on the UltraSPARCIII+ instruction set in-order issue structure. Each processor core is 
running at 2GHz, attached to a wormhole router and has a private write-back L1 cache (split I+D, each 
32KB, 4-way, 64-bit line, 3-cycle). The 16MB L2 cache shared by all processors is split into banks (16 
banks, each 1MB, 64-bit line, 6-cycle). We setup a system with 4GB of main memory, and the latency 
from the main memory to the L2 cache is 260 cycles. The simulated memory/cache architecture mimics 
SNUCA [18]. A two-level distributed directory cache coherence protocol called MOESI based on MESI 
[19] has been implemented in our memory hierarchy in which each L2 bank has its own directory. The 
protocol has five types of cache line status: Modified (M), Owned (O), Exclusive (E), Shared (S) and 
Invalid (I). We use Simics [20] full system simulator as our simulation platform. 
3.2. Result Analysis 
We evaluate performance in terms of Average Network Latency (ANL), Average Link Utilization 
(ALU), Execution Time (ET) and Cache Hit Latencies (CHL). ANL represents the number of average 
cycles required for the transmission of all network messages. The number of cycles of each message is 
calculated as, from the injection of the message header into the network at the source node, to the 
reception of the tail flit at the destination node. ALU is defined as the number of flits transferred between 
NoC resources per cycle. Under the same configuration and workload, lower values of these metrics are 
favorable. 
317Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 7 
Fig. 3. Normalized performance metrics with original and optimized NoC for FFT. 
 
The results are depicted in Figure 3. Our optimized NoC architecture outperforms the original design 
in all metrics. For example, the ANL for the optimized NoC is 9.89% lower than the original. This is 
primarily due to the lower latencies between hotspot nodes, e.g. N0-N11 and N0-N3, compared with the 
original design. As aforementioned, the transpose steps in FFT require communication of all processors 
(especially the last stage, see Figure 2). We note that the communication is not evenly distributed to all 
processors. In this case, reducing the delay of the hopspot nodes by adding long links is a feasible method. 
The ALU of FFT for our optimized design is 2.15% lower than the original as well. Apparently, the 
improvement is not as significant as ANL. The reason is that, there are only two additional links which 
can alleviate the overall link load. The CHL in our design is 4.66% lower than the original, because of the 
reduced latencies. Overall, in terms of ET, our design uses 12.13% less time than the original. This 
reflects the savings of ANL, ALU and CHL. 
4. Conclusion 
In this paper, we proposed an optimized Network-on-Chip architecture for data parallel FFT. A one-
dimensional, radix-n, six-step FFT algorithm was selected. We analysed low-level traffic pattern for FFT. 
Several hotspots were found. To evaluate the network latencies, we model an on-chip network based on 
modern multicore processor. An optimization method, namely long links between hotspot nodes, was 
introduced. Results show that, the reduced latencies have a strong impact on system performance. The 
execution time of our optimized design was 12.13% faster than the original design. The results of this 
paper give a guideline for designing Network-on-Chips optimized for data parallel FFT. 
References 
[1] L. Benini and G. D. Micheli, ”Networks on chips: A new soc paradigm”, IEEE Computer, January 2002,35(1):70–78,. 
[2] Intel. Single-chip cloud computer, July 2011. http://techresearch.intel.com/articles/Tera-Scale/1826.htm. 
[3] T. Corporation,http://www.tilera.com,.July 2011. 
[4] A. R. Bahai and B. R. Saltzberg,”Multi-Carrier Digital Communications: Theory and Applications of Ofdm”, Plenum 
Publishing Co., 1999. 
318  Thomas Canhao Xu et al. / Procedia Engineering 30 (2012) 311 – 318
 Thomas Canhao Xu,et.al / Procedia Engineering 00 (2011) 000–000 8 
[5] A. Yarali and B. Ahsant, “802.11n: the new wave in wlan technology”, in Proceedings of the 4th international conference on 
mobile technology, applications,  and  systems  and  the  1st  international  symposium  on Computer human interaction in 
mobile technology, Mobility '07, New York, NY, USA, 2007, ACM pages 310–316. 
[6] C. Eklund, R. Marks, K. Stanwood, and S. Wang, ” IEEE standard 802.16: a technical overview of the wireless mantm air 
interface for broadband wireless access”, Communications Magazine, IEEE, June 2002,40(6):98 –107. 
[7]  D.  Astely, E.  Dahlman, A.  Furuskar, Y.  Jading,  M.  Lindstrom and S. Parkvall,” Lte: the evolution of mobile broadband.  
Communications Magazine”, IEEE, 2009,47(4):44 –51,. 
[8]  G.  Faria,  J.  Henriksson,  E.  Stare,  and  P.  Talmola,”Dvb-h:  Digital broadcast  services  to  handheld  devices”,   
Proceedings  of  the  IEEE, 2006, 94(1):194 –209. 
[9]   J. W. Cooley and J. W. Tukey,”An algorithm for the machine calculation of complex Fourier series. Math. Comp.”, 1965, 
19:297–301,. 
[10]   P. Duhamel and H. Hollmann,”Split radix' fft algorithm, Electronics Letters”, 1984,20(1):14 –16, 5. 
[11]   J. H. Bahn, J. Yang and N. Bagherzadeh, “Parallel fft algorithms on network-on-chips. In Information Technology”, New 
Generations, 2008, ITNG 2008, Fifth International Conference , 2008, pages 1087 –1093. 
[12]   R. Airoldi, F. Garzia, and J. Nurmi,” Fft algorithms evaluation on a homogeneous multi-processor system-on-chip”,   in 
Parallel Processing Workshops (ICPPW),  39th International Conference ,2010, pages 58 –64. 
[13]    D. H. Bailey, “Ffts in external or hierarchical memory”, The Journal of Supercomputing ,1990, 4:23–35. 
10.1007/BF00162341. 
[14]   S. C. Woo, J. P. Singh, and J. L. Hennessy, ”The performance advantages of integrating block data transfer in cache-coherent 
multiprocessors”, in ASPLOS-VI, New York, NY, USA, 1994. ACM ,pages 219–229. 
[15]   M. Tremblay and S. Chaudhry, “A third-generation  65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor,” in 
ISSCC 2008, February 2008, pp. 82–83. 
[16]   T. Shyamkumar, M. Naveen, A. J. Ho, and J. N. P., “Cacti 5.1,” HP Labs, Tech. Rep. HPL-2008-20. 
[17]   Dongkook Park, Eachempati, S., Das. R., Mishra A.K., Yuan Xi, Vijaykrishnan, N., Das, C.R , "MIRA: A Multi-layered On-
Chip Interconnect Router Architecture," ISCA '08. 35th International Symposium, June 2008, vol., no., pp.251-261, 21-25. 
[18]   Kim, C., Burger, D., Keckler, S.W.,” An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches”, in 
ACM SIGPLAN,  October 2002, 211–222. 
[19]   Patel, A., Ghose, K,”Energy-efficient mesi cache coherence with pro-active snoop filtering for multicore microprocessors”,  in 
Proceeding of the thirteenth international symposium on Low power electronics and design, August 2008, 247–252. 
[20]   Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.  
“Simics: A full system simulation platform. Computer 35(2) “,February 2002, 50–58. 
