Parallelizing Network Coding on Manycore GPU-Accelerated System with Optimization  by Gan, Xinbiao et al.
Procedia Engineering 15 (2011) 3063 – 3067
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2011.08.574
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
Procedia Engineering 00 (2011) 000–000 
Procedia
Engineering
www.elsevier.com/locate/procedia
Advanced in Control Engineering and Information Science 
Parallelizing Network Coding on Manycore GPU-Accelerated System 
with Optimization
Xinbiao Gan *  Li Shen   Qi Zhu   Zhiying Wang 
School of Computer, National University of Defense Technology, Changsha, 410073，China 
Abstract 
It is well known that network coding has emerged as a promising technique to improve network throughput and 
available bandwidth. But, due to high computational complexity, the practicability of network coding has remained to 
be a challenge. At the same time, applications accelerated by GPU are confined to traditional methods, in which GPU 
is used as a coprocessor to consume dataset transferred from CPU. Therefore, an aggressive parallel network coding 
framework with optimization is customized for GPU, in which an appropriate granularity of parallelism for network 
coding is presented, and GPU can act as not only data consumer but also data producer. Moreover, random linear 
network coding is parallelizing and optimizing on CUDA-enabled GPU to validate proposed techniques. 
Experimental results demonstrate that it is effective to parallelize network coding on manycore GPU-accelerated 
system using proposed techniques. 
© 2011 Published by Elsevier Ltd. 
Selection and/or peer-review under responsibility of [CEIS 2011] 
keywords: GPU; network coding; parallelizing; CUDA; optimization 
1. Introduction
Network coding, integrating information coding and network routing, is a technique for exchange of
information, in which packets are coded before forwarded. 
Due to the advantages of improving the network throughput, balancing loads, reducing transmission
delay and node energy consumption, and enhancing the network robustness[1], network coding has been 
widely used in distributed file storage[2], , wireless network[3]. However, the reality of non-deterministic 
network environment[4], the high computational complexity of network coding[5] are damaging the 
performance of network coding systems, so its practicality is still a challenge. Therefore, it is practical to 
optimize network coding[6], and optimizations are favor of reducing computation and communication 
* * Corresponding author: +8613407318723 
E-mail addressxinbiaogan@163.com:
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
3064  Xinbiao Gan et al. / Procedia Engineering 15 (2011) 3063 – 30672 Xinbiao Gan et al/ Procedia Engineeri g 00 (2011) 00–0 0 
overhead of network coding, including improvement of network coding algorithm [7] and acceleration of 
network coding based on hardware or architecture [5] [8-9].
Network coding accelerated by GPU architecture has achieved remarkable results and is still ongoing. 
However, previous work dedicated to maximize the consumption of computing resources of GPU for the 
efficiency of parallel network coding system but it was still under desirable. At the same time, there is 
little work on optimization of memory hierarchy for network coding system. Therefore, an effective 
parallel network coding framework including memory optimization is proposed for improving utilization 
other than consumption of GPU computing resources. 
2. Architecture  and CUDA Programming Model 
Figure 1 shows an overview of the architecture of collaborative system composed of GPU and CPU, in 
which data are transferred between CPU and GPU with PCIe channel on demand. GPU architecture 
consists of a scalable number of streaming multiprocessors (SMs), each containing eight streaming 
processor (SP) cores and every three SMs constitute a threading multiprocessor cluster (TPC) in NVidia 
GTX 280, a read-only constant cache, and a read-only texture cache. Additionally, Each SM has16KB 
read and write shared memory which is common to all the 8 SPs inside it.  
In CUDA-enabled GPU, instructions are structured in SPMD (Single Program, Multiple Data), and 
CUDA (Compute Unified Device Architecture) execution model provides three key abstractions, a 
hierarchy of thread groups, shared memories, and barrier synchronization. Threads have a three level 
hierarchy. A grid is a set of thread blocks that execute a kernel function. Each grid consists of blocks of 
threads. Each block is composed of hundreds of threads. Threads within one block can share data using 
shared memory and can be synchronized at a barrier. All threads within a block are executed concurrently 
in a form called warp, which is composed of 32 parallel threads, and Instructions are scheduled and 
managed based on warp in SIMT (Single Instruction, Multiple Threads) architecture [10]. Consequently, 
thread-level parallelism are prone to exploited according to CUDA execution model as Figure 1, but both 
data-level parallelism and thread-level parallelism would be exploited as long as slight modifications are 
performed on traditional algorithm framework especially for computing-intensive applications. 
CUDA threads may access data from multiple memory spaces during their execution such as register, 
shared memory, constant cache, texture cache and global memory or video memory, but data should be 
loaded from CPU with PCIe channel at first and then stored into GPU memory hierarchy. Although 
loading time would be hided and apportioned among thousands of CUDA threads, transferring data from 
CPU to GPU also would be a performance bottleneck especially for data-intensive applications. Therefore, 
memory optimizations are proposed in hierarchy parallel framework. 
Figure 1．Collaborative system architecture 
Figure
2．Parallelizing network coding  
3065Xinbiao Gan et al. / Procedia Engineering 15 (2011) 3063 – 3067 Xinbiao Gan et al/ rocedia Engineering 00 (2011) 000–000 3
3. Parallelizing Network Coding on CUDA-enabled GPU 
3.1. Random Linear Network Coding 
The linear network coding is to map packets into a finite field linearly; encoding and decoding both are 
linear transformation [7]. Traditionally, data to be distributed is divided into n blocks ( )1 1, , , nb b bL , where 
each block consists of m words to be coded, then selecting random coding coefficients ( )i1 i2 inc ,c , ,cL in 
vector format to generate a fragment ofχ to be transmitted as Equation (1). 
n
j ji i
i 1
c b
=
χ =∑  (1) 
Since each coded fragmentχ is a linear combination of raw data fragments, the raw fragments can be 
decoded as long as the node receives[ ], , , Tn1 2χ χ χL .
Define coefficient matrix C composed of ( )i1 i2 inc ,c , ,cL , in which each line corresponds to a coding 
coefficient for a raw fragment, so the original fragment can be decoded as Equation (2). 
1
i jb C
−
= χ
 (2) 
According to Equation (2), when rows of C are linearly independent, the inverse of C exists. 
3.2. Parallelizing Coding Based on Fragment 
As summarized in [5] and [8-9], the linear encoding suffers from two major performance bottlenecks. 
First, a multiplication in finite field is a costly operation. Second, the multiplication and addition 
operations are performed in tight loops over rows of coefficients and coded fragments, each of n and k 
bytes, respectively. Therefore, multiplication in the finite field is converted to a parallel looking-up table 
in above literatures, as shown in Equation (3). 
( )xy exp log x log y= +  (3) 
In Equation (3), log x and log y are pre-calculated and then stored in memory, so a multiplication 
operation in finite field is converted into two lookups and one addition operation.  
However, it is not always advisable to convert costly multiplication operation into inexpensive look-up 
table and addition operations. In compute-bound architecture such as general CPU, above conversion 
would advance system performance; but in memory-bound architecture such as GPU, above conversion 
would damage rather than advance system performance, that is because look-up table operation must be 
much more costly than multiplication operation. At the same time, the space for table is limited by finite 
field, which would take up costly memory resource in memory-bound architecture. Even if above 
conversion is advisable, one multiplication operation can be more aggressive to be converted into one 
look-up table operation in finite field. 
In CUDA-enabled GPU architecture, a naive and traditional parallel algorithm is that volume of 
processing data should be large enough or partitioned fragment should be small enough so that millions of 
threads would be run simultaneously to hide and share multiplication overhead. So the size of fragment is 
in byte mode according to finite field ( )8GF 2 in [5] and [8-9], which would start enough threads but 
increase computation and communication among threads, which would waste computing resources; more 
unfortunately, communication between threads in GPU architecture is rather costly. Therefore, 2 bytes 
mode is proposed for the size of fragment. More bytes mode is not selected because the length of word in 
3066  Xinbiao Gan et al. / Procedia Engineering 15 (2011) 3063 – 30674 Xinbiao Gan et al/ Procedia Engineeri g 00 (2011) 00–0 0 
GPU is 32-bit; too more bytes might decrease computation but would also increase communication 
among threads. 
Consequently, packets can be divided into n fragments, and each consists of 2 bytes. And then, each 
fragment would be coded by a separate thread. Figure 2 shows parallel loading data and encoding packets 
among threads in a thread block. CUDA threads within a thread block would be scheduled into one SM 
and share 16K shared memory. It is effective and convenient for intra-thread communication within one 
thread block, which can avoid access global memory frequently. 
4. Architecture-oriented Optimizations on Network Coding 
4.1.  Optimization Using Runtime Component for GPU 
Costly multiplication operation is converted to look-up table and addition operations for improving 
system performance as stated in [5] and [8-9]. In above proposed framework, an appropriate granularity 
of parallelism for network coding is presented to hide and share overhead of multiplication operations. 
Furthermore, runtime components supported in GPU device are used to replace standard math functions. 
That is because math functions supported in runtime components in GPU device are directly mapped into 
one or several GPU instructions and architectural optimization but lower precision. Fortunately, addition 
and multiplication supported in runtime components in GPU device are in accord with IEEE standard [10],
so they are the same as precision but faster than standard library functions. Therefore, __fmul(x, y) and 
__fadd(x, y) supported in runtime components in GPU device are applied to calculation of network 
coding for further improving system performance. 
4.2. Parallelizing Decoding Based on Texture Cache 
In reality, texture cache resides in global memory but cached, at the same time, texture cache has been 
optimized for two-dimensional data structure in architecture [10].
In process of network decoding, after inverse of coefficient matrix is calculated on GPU device and 
then stored in memory, there are many accesses but no update on the inverse of coefficient matrix C-1, so 
C-1 should be cached. Shared memory is a space for caching C-1 but its size is limited and caching C-1 in 
shared memory must increase memory pressure for inter-thread communication within one thread block; 
constant cache is another space for caching C-1 but its size is also smaller than texture cache and constant 
cache is not optimized for two-dimensional data structure in architecture, random access to specific lines 
or column in matrix is not facilitated [10]. Accordingly, C-1 would be cached in texture cache for best 
utilization of memory hierarchy. 
5. Experiments and Analysis 
In order to validate the efficiency of parallel network coding, method used in [5] and [8-9] is tagged as 
TB(Table-Based); proposed network coding without optimization is marked as Frag; proposed network 
coding with optimization is tagged as Frag + Opt. 
Figure 3．Comparison on coding speed Figure 4．Comparison on decoding 
3067Xinbiao Gan et al. / Procedia Engineering 15 (2011) 3063 – 3067 Xinbiao Gan et al/ rocedia Engineering 00 (2011) 000–000 5
5.1. Comparison on Coding Speed 
Figure 3 shows that the performance of proposed network coding framework Frag is much better than 
that of TB; however, the performance of optimization for proposed network coding framework is not 
expected, that is because an appropriate granularity of parallelism for network coding is much more 
important than an inappropriate granularity of parallelism for network coding with optimization for 
network coding speed, but the presented optimization techniques could be applied to other application 
accelerated on GPU.  
5.2. Comparison on Decoding Bandwidth   
Figure 4 shows that texture cache is very important for improving the utilization of bandwidth of 
memory hierarchy. On NVidia GTX280, bandwidth utilization is the same as that of literature [8] under 
same configurations. However, when texture cache is introduced, utilization of bandwidth is much higher 
than that of no optimization without texture cache.  
6. Conclusion 
Network coding, once proposed, to which industrial and academic have paid much attention, but the 
practicality of network coding is still a challenge, largely because the overhead of calculation of network 
coding. Introducing many-core GPU to accelerate network coding can achieve high speed-up, which 
would eliminate at least alleviate above embarrassment. GPU is an promising high-performance 
architecture, it is better to understand the architecture of powerful computing resources and the hierarchy 
of available memory bandwidth and the collaboration between GPU and CPU for maximizing 
performance and minimizing overhead. 
Acknowledgements 
 These and the Reference headings are in bold but have no numbers. Text below continues as normal.  
This work is partly supported by the National Grand Fundamental Research Foundation of China under 
Grant No.2007CB310901, the National Natural Science Foundation of China under Grant No. 60803041, 
the innovation Program for Excellent Graduates Foundation of national University of Defense 
Technology of China under Grant No. B090603, and Hunan Provincial Innovation Foundation for 
Postgraduates under Grant No. CX2010B031. 
References 
[1]  Yang L, Zheng G, Hu XH. Research on Network Coding：A Survey. Journal of computer research and development [J].2008, 
45(8):400-407.(in Chinese) 
[2] A. G. Dimakis, P. B. Godfrey, M. Wainwright, et al. Network coding for distributed storage systems．// Proceedings of the 
26th Annual IEEE Conf on Computer Communications (INFOCOM 2007), Anchorage, USA, 2007:2000-2008. 
[3] WANG Y, LIN C, LI Q, etc. Non-Cooperative Game Based Research on Routing Schemes for Wireless Networks. Chinese 
journal of computers [J].2009, 32(1):54-68. (in Chinese) 
[4] P. Chou, Y. Wu A K J. Practical Network Coding. // Proceedings of the Allerton Conference on Communication, Control, and 
Computing, Monticello, IL, 2003：63-68. 
[5] H. Shojania, and Baochun Li. Pushing the Envelope: Extreme Network Coding on the GPU. // Proceedings of the 29th IEEE 
International Conference on Distributed Computing Systems, Montreal, QC ,2009: 1063-6927. 
[6] HUANG Z, WANG X. Research on the Optimization Problems in Network Coding. Journal of software [J].2009, 20(5):1349-
1360. (in Chinese) 
[7] T. Ho, M. Medard, R. Koetter, D. R. Karger, etc. A random linear network coding approach to multicast, IEEE Transactions on 
Information Theory, 2006, 52(10): 4413–4430. 
[8] H. Shojania, Baochun Li, and Xin W. Nuclei: GPU-Accelerated Many-Core Network Coding. // Proceedings of the 
INFOCOM, Rio de Janeiro, Brazil, 2009: 459-467. 
[9] X.-W. Chu, K.-Y Zhao, and M. Wang: Massively Parallel Network Coding on GPUs. // Proceedings of the IEEE Performance, 
Computing and Communications Conference, Austin, Texas, 2008:144-151. 
[10] NVIDIA. CUDA programming guide 2.0. NVIDIA Corporation, 2008. 
