Abstract-In this paper, techniques of Reed-Solomon (RS) codes to recover lost packets in digital video/audio broadcasting and packet switched network communications are reviewed. Usually, different RS codes and their corresponding encoders/decoders are designed and utilized to meet different requirements for different systems and applications. We incorporate these techniques into a variable RS code and present an encoding and decoding algorithm suitable for the variable RS code. A mother RS code can be used to produce a variety of RS codes and the same encoder/decoder can be used for all the derivative codes with adding/detecting zeros, removing a part of parity symbols, and adding erasures. A VLSI implementation for erasure decoding of the proposed variable RS code is described and the achievable performance is quantitatively analyzed. A typical example shows that the signal processing speed is up to 2.5 Gbits/second and the processing delay is less than one millisecond, when integrating the decoder on a single chip. Therefore, the proposed algorithm and the encoder/decoder can universally be utilized for different applications with various requirements, such as transmission data rate, packet length, packet loss protection capacity, as well as layered protection and adaptive redundancy protection in DVB/DAB broadcasting, Internet and mobile Internet communications.
I. INTRODUCTION

E
RROR correcting codes are traditionally applied to correct erroneous bits or symbols [2] , but have recently been proposed to recover lost packets due to channel fading, interference and network congestion in video/audio broadcasting, multicasting and real-time Internet communications [9] - [11] , [16] . Layered video was first proposed in 1980s [3] and is now widely accepted as a means to transfer real-time video and audio over heterogeneous networks [1] , [4] , [7] . The applications of unequal packet loss protection (UPLP) for layered video has been investigated and it is shown that UPLP can achieve a more efficient bandwidth utilization and much better picture quality than equal protection [17] .
The Reed-Solomon (RS) codes as maximum distance separable (MDS) codes have been suggested to be applied for packet loss protection in many papers, such as [16] , [17] . In this paper we present a design technique of variable shortened-and-punctured RS codes for packet loss protection that can be used in many different applications with different protection requirements, that is, the same VLSI chips can be applied for variety of codes with different code length, different code rate, variable redundancy, as well as for different transmission data rate and different packet length.
The paper is structured as follows. Section II gives a brief introduction to why variable RS codes are desired and how the RS codes are used to recover the lost packets. Section III describes the encoding and decoding algorithms of RS codes for packet loss protection and shows that the same encoder and decoder can simply be applied to achieve different code parameters (length, dimension and minimum distance), i.e., variable shortened-and-punctured RS codes. Section IV describes the algorithm for inversion of the Vandermond matrix, which is required for the implementation of RS erasure decoding. In Section V, we outline the circuits for the implementation of the decoders. In Section VI the decoder performance characterized by achievable data rate, decoding delay and implementation complexity, is analyzed. Finally, a conclusion is given in Section VII.
II. APPLICATIONS OF VARIABLE RS CODES
Multicasting is a common scenario in various communication systems, such as DAB/DVB, Internet and mobile Internet. The packet loss rate due to traffic congestion, bad link and interference is usually higher than the loss rate required by the applications. The transport protocol must make up for the difference in packet loss rate.
To recover lost packets, two well-known techniques exist: automatic repeat request (ARQ), which retransmits the lost packets, and forward error correction codes (FEC) in packet level, which transmits redundant packets. Fig. 1 illustrates a very simple multicast system. Suppose a transmitter sends five information packets to four receivers, 1, 2, 3, 4. In the transmission, the first packets (on the bottom of the block) is lost at , and , the second at and , the third at , the fourth at , and the fifth at . As a result, all the five packets must be retransmitted if ARQ is applied. Instead of sending original information packets, if MDS coded redundancy packets are applied to recover the lost packets, then the number of required redundancy packets is three in this example. In general, if the total number of received information packets and redundancy packets is equal or greater than the number of information packets, all the lost information packets can be recovered, regardless of which original packets are lost. Many papers have shown that FEC can be used to produce inherently scalable reliable multicast transport protocols, to meet hard real-time deadlines, to reduce or without feedback implosion, to reduce bandwidth requirements compared to ARQ. Now we describe a coding scheme for the construction of redundancy packets.
Suppose that a block information has bits and is set into a two-dimensional array. Further let and every bits form a symbol in finite field GF . The encoding is performed in two stages: outer (column) encoding and inner (row) encoding, as show in Fig. 2 .
Typically, the outer-codes are Reed-Solomon (RS) codes over GF . Each column of information symbols in GF is encoded into a codeword of , where (number of redundancy symbols). Totally there are outer codewords in a block. Then, add packet header of length bits to each row, encode each row into a codeword of code having redundance of bits, and form a packet of length . If the code dimension includes the header according to the particular standards, we have , and . Totally there are information packets and redundancy packets in a block. A binary BCH code can possibly be used as an inner code for simultaneous error correction and error detection.
The decoding is also performed in two stages: inner (row) decoding and outer (column) decoding. If the number of errors in a packet is less or equal to a predetermined value, the information data can correctly be found, and fed to the corresponding row of the information array, otherwise, the packet is declared erasured. Then it is up to the outer decoding to recover the erasures. As a maximum distance separable code, the code is able to recover all erased information symbols if symbols of are received [2] . We now describe a general approach for the coding scheme. We can take symbols ( bits) from every information packet as the message data and encode the symbols into a codeword of the outer code, then take symbols from the redundancy symbols of every outer codeword as the payload of a redundancy packet and format the packet to be transmitted.
Multicasting provides an efficient way of disseminating data from a sender to a group of receivers. However, the heterogeneity and scale of the networks make the multicast more problematic. In a scenario, some users participate over the highspeed networks, some interact over the low-speed networks, and still some join at home using ISDN telephone lines. The difference in network capacity can be several orders of magnitude [7] . Moreover, in a future digital HDTV broadcasting system it may be advantageous to accommodate a variety of heterogeneous receivers with different requirements of service quality, such as portable TV, laptops, SDTV and HDTV [12] .
An efficient and widely accepted solution to the problem of heterogeneous networks and heterogeneous terminals is to combine layered cumulative compression algorithms with layered packet transmission schemes. In the approach, a signal is encoded into a number of layers that can be incrementally combined to provide progressive refinement. The local networks and receivers may dynamically select the number of layers, according to their capacities [7] .
It has been demonstrated that layered video is more sensitive to packet loss than un-layered video and unequal loss protection is more efficient than equal protection. Such unequal protection can be achieved by choosing different number of redundancy packets. Lower layers use more redundancy packets, and higher layers use fewer redundancy packets [17] . Fig. 3 shows an example of unequal packet loss protection for layered data transmission.
Let represent the number of information packets in a block at layer , and the number of redundancy packets. For the unequal protection, we have and , where is the number of layers. The block length is . If we design an RS code such that , ,
. Then all the RS codes are shortened-and-punctured codes of the mother code [2] , which is described in the next section.
III. VARIABLE RS CODES AND ERASURE DECODING
A Reed-Solomon code of length , dimension , minimum distance over GF is defined in its simple form:
Let be distinct nonzero elements from GF . Define the matrix according to (1) A Reed-Solomon , , ) code [5] can be defined as GF (2) In principle, can be arbitrary ordering of distinct nonzero elements of GF . To simplify the computations, we choose the ordering to be for where is an element of GF of order greater than . Therefore the parity-check matrix has the form (3) where . Suppose this code is used to correct erasures. Any pattern of erasures can be corrected iff . To achieve a universal hardware implementation of erasure decoders, we can simply set received redundancy symbols as erasures. In the following we always assume . Now we give the following definitions and notations:
, sent codeword Define the sub-matrix of the parity check matrix , corresponding to the erasure position index (4) where ; , . The symbols of are called erasure locations. Corresponding to the observed positions , define also (5) where ; , . Finally define the syndrome by (6) If the erasures in the received vectors are replaced with zeros, that is (7) from (6) the syndromes can also be obtained by using the formula (8) For each codeword , we have . Thus the erased vector can be solved by (9) The parity check matrix in (1) can be converted into a matrix of the form: (10) where is an identity matrix. Therefore the generator matrix of the RS code is (11) where is a identity matrix, is a matrix. Given the message vector (12) the codeword is (13) Given an RS code as a mother code, a variety of RS codes can be derived from the mother code to fit different applications and a universal encoder and decoder on chips can be used for different applications, if and . The derived codes are called shortened-and-punctured RS codes, which are MDS codes, i.e.,
. The derived codes can be represented as RS (14) where is the number of removed data symbols and the number of deleted redundant symbols, and , , .
Encoding of RS
Codes: Insert zeros at the beginning of each message vector of length , i.e.,
, then encode as the mother code RS by the use of (13), remove both the first zeros and the last parity symbols and transmit the remaining symbols.
Decoding of RS
Codes: Insert zeros at the beginning and the erasures at the end of the received vector, decode as the mother code RS using (9), finally discard the inserted symbols from the decoded vector.
Example: A primitive RS(7,4,4) RS code over GF (8) generated by can be defined by the parity matrix or Therefore the generator matrix is A shortened-and-punctured (5,3,3) code can be derived from the RS(7,4,4) code.
Given the message vector . Then the codeword should be , where the symbols in [ ] will not be transmitted. Thus the transmitted vector should be . Suppose the received vector is , where " " is an erasure. Insert the untransmitted symbol, then the input vector of the decoder will be . Therefore we have unknown
IV. INVERSION OF THE VANDERMONDE MATRICES
To recover the erased symbols, matrix inversion is required, as shown in (9) . Inversion of a large dimensional matrix is timeconsuming, which will result in long delay. Inversion of the Vandermonde matrices is described in this section and the circuit design is introduced in the next section, which shows that the decoding delay is minimal. This is specially useful for real-time communications.
Let the field under consideration be the finite field GF and let be distinct nonzero elements from GF . A Vandermonde matrix can be defined as follows: (15) i.e., the matrix with ; . It is obvious that the matrix defined in (6) is a Vandermonde matrix.
Now define a set of polynomials (16) Let the matrix be determined by the coefficients of the polynomial . Consider the matrix product . We have [5] , [13] .
Therefore the inverse of the Vandermonde matrix is given by
To simplify the implementation, we define the polynomial
and denote
Thus can be obtained from the formula
Given a matrix , the inverted matrix can be obtained using (19)-(21). Polynomial multiplication and division are easy to be implemented with digital shift-register logic.
Example: Given a Vandermonde matrix shown in the previous example then we have
V. CIRCUIT DESIGN OF DECODERS
We now study the circuit implementation of these formulas and erasure decoders. All elements such as flip-flops, adders and multipliers, work in GF GF , which also implies that for every element in the field. A shift register implementation for the sum representation of the polynomial in (19) is outlined in Fig. 4 . Each stage of the shift-register consists of a multiplier, an adder and a flipflop, with exception of the first stage and the last. The register is initialized with 1000 0. Any intermediate result is stored in the shift-register. When an erasure comes, the location which is a nonzero element in GF is imported into the circuit. By multiplication, addition and shifting, the product of previous polynomial and is performed. Finally, all the coefficients of polynomial are performed and stored in the shift-register. Fig. 5 illustrates a circuit for calculating using (20). The shift register of one stage SR2-i is initialized by 1. Any intermediate result of element product is stored in this register. When a new erasure location is fed into the input, the is added to the erasure location , then the sum is multiplied with the previous intermediate result, and the product is shifted into the register. Fig. 5 shows also how all the are calculated in parallel. At the beginning, are parallelly fed into their calculators via switches to the memory. Then, switch is closed, and all of are connected to the adders. After shifts and calculations, all are produced, and closed, then is inversed. A circuit for calculating the set of polynomial by (21) is shown in Fig. 6 . The shift register of stages is initialized by the vector . All the shift registers of one stage are initialized by . are serially produced for and parallelly for . Fig. 7 shows a circuit for the calculation of syndromes , using (8) . Since shift-register SRI is initialized by , the output of Mul1 will be the sequence of , which will be multiplied with respectively at Mul2 and the products will then be accumulated.
A circuit for recovering the erased symbols for using (9) is illustrated in Fig. 8 . This is a simple implementation for the element multiplication and accumulation. Since is produced before , as shown in Fig. 6 , the output of shift-register SR1 in Fig. 8 starts from and finally . The block diagram of a single erasure decoder is shown in Fig. 9 . As the received signal is demodulated, the vectors of erasure locations , syndromes and coefficients of are simultaneously generated. After receiving a codeword and the calculations, those vectors are fed into the corresponding buffers , , and in parallel for the second stage calculations, and then the third stage.
VI. ANALYSIS OF THE DECODER PERFORMANCE
The performance of decoders can be characterized by several measures, such as achievable data rate, decoding delay and implementation complexity. The achievable data rate and decoding delay are functions of complexity, as analyzed later.
The complexity of the decoding algorithm is usually measured by the number of multiplications. However, the decoding speed and the circuit size may change significantly in different implementations. Therefore, implementations should be judged in more detail by their time and space complexity.
The time complexity is the number of time units required to process one input. A similar definition is used to define space complexity. We take a two-input and one-output XOR logic gate as the space unit, and the propagation delay of a two-input and one-output XOR logic gate as the time unit. We will deal exclusively with the fields GF , where is any integer. The time complexity and space complexity (size) of some basic operation units in GF have been investigated in [6] , [14] . Assume that all units have bit-parallel inputs and outputs. Their complexity is summarized in Table I . Now we investigate the time and space complexity of each component in Fig. 9 . 1) Calculators : As shown in Fig. 7 , it need iterations to perform a syndrome calculation. Each iteration need one clock to perform two multiplications, one addition and register shifting. Thus, the time complexity is (22)
The figure also shows that two multipliers, one adder and two Flip-Flops are needed. As is a fixed symbol, the space complexity can be set to zero. There are syndromes to be parallelly calculated. Thus the space complexity (size) is (23) 2) Calculator : As shown in Fig. 4 , there are iterations. Each iteration performs one multiplication, one addition and register shifting. However, the iterations are synchronized with the arrivals of erasures and the time period required for each iteration in Calculator is shorter than that in Calculator . We have (24) (25)
3) Calculator and with Buffer-X1: As shown in Fig. 5 , it needs iterations to perform the calculations of . In addition, elements inversion is required. Thus, we have (26) (27) Buffer-X1 in Fig. 9 is also shown as SRI in Fig. 5. 
4) Matrix inverter
and erasure recovery with Buffer-F2 and Buffer-S2: As shown in Figs. 6 and 8 , for are serially generated and fed into the calculator , so we consider them as a unit. Buffer-F2 and Buffer-S2 in Fig. 9 are also shown in Figs. 6 and 8 respectively. There are iterations and each iteration performs two multiplications, one addition and register shifting in Fig. 6 . Each iteration performs one multiplication, one addition and shifting in Fig. 8 , which is less than that in Fig. 6 . The time for one iteration must be designed to satisfy the requirement of the longest one. The complexity of the unit can be estimated by (28) (29) Fig. 9 shows a pipeline structure to implement the erasure decoding algorithm. The decoding of a received word is accomplished in 3 stages: 1) Calculation of syndromes and erasure location polynomials, and . 2) Calculation of erasure factors and their inverses, and . 3) Finding the inverted matrix and erased symbols, and . For the pipeline structure, the required time of each stage should be equal and depends on the maximum value among the , and , that is By inspecting Fig. 2 , a block of packets may consist of many codewords. If a block of received packets is decoded using one single decoder, it might not be suitable for real-time communication. Therefore, a set of decoders can be supplied parallelly to find the erased packets. A structure of the set of decoders is shown in Fig. 10 .
The received codewords with erasures are fed into the input logic, in which shortened and punctured symbols are replaced with zeros and erasures respectively if the applied code is a shortened-and-punctured code of the mother code. The programmable logic controls code parameters, synchronization, data distribution, and so on. The output of the decoders is stored in the output RAM and codewords are assembled.
Given the number of the decoders in the set is . Since each decoder process three received codeword in the pipeline with three stages, totally received codeword must be stored in the output RAM. The space complexity of the RAM is about (35) Let the space complexity of the logic control and other auxiliary circuits be , then the time and space complexity of the set decoders are respectively (36) (37) Suppose that the number of RS codewords in a block of packets is and the set of decoders is utilized for a specific application. Given that the number of decoders is less than the number of parallel decoders, , then the set of decoders must be repeatedly utilized for a complete erasure decoding of the block. The number of received codewords in a block, which a single decoder must handle, is (38) Now we estimate the performance of the decoders: processing speed, decoding delay and implementation complexity. 1) Processing speed: Processing speed is defined as the number of input bits which can be processed per second. Each decoder in Fig. 10 processes three code vectors in pipeline, the current input vector and the previous input vectors, and an input vector is processed by three stages. The number of time units designed to processing one vector at each of the three stages should be . Each vector has symbols in GF , therefore the achievable processing speed or acceptable input bit rate is (39) where represents the gate delay of a two-input and one-output XOR gate, which is about 1 ns nowadays. If a variable RS code is utilized for a specific application, the processing speed will be (40) 2) Decoding delay: In the paper the decoding delay is defined as the time deference when a block of packets has been received and when all the lost data of the block have been found, without consideration of packet transmission and reassembling. The undecoded vectors from a block are sequentially fed into a pipeline decoder of three stages. Three codewords are sequentially processed in the stages, therefore the decoding delay of a block of packets is (41) 3) Implementation complexity: Implementation complexity is defined as the number of transistors which are required in the decoders. Given that the number of transistors in an XOR gate is 6, then the implementation complexity is about c) Suppose such a VLSI chip is utilized for a specific application. Assume 48 information packets and 12 redundancy packets constitute a block, and the data length of each packet is 1024 bytes. We can now use the shortened-and-punctured RS(240, 192) code. Take 4 bytes from every information packet as the message data in a codeword and encode the 192 bytes into a codeword of 240 symbols over GF . Take 4 symbols from the redundancy symbols of every RS(240, 192) codeword as the load of a redundancy packet and format the packet to be transmitted. There are RS codewords in a block of packets. Each of the four decoders must process received codewords, therefore, the achievable processing speed form (40) is bits/second, and the decoding delay from (41) is ms, which is expected for real-time communications.
VII. CONCLUSION
In this paper we have presented a universal encoding and decoding algorithm of variable Reed-Solomon codes for packet loss recovery. By use of the algorithm, the same encoder and decoder can simply be applied for a variety of RS codes of different code parameters (code length, code rate and minimum distance) with adding/detecting zero symbols, removing a part of parity symbols and adding erasures. The code is based on Vandermonal matrices. We revise the algorithm for inversion of Vandermond matrices in order to simplify the implementation of erasure decoding. The circuits for erasure decoding is described in detail. It is shown that the decoding algorithm can be implemented with simple, regular and modular circuits, naturally suitable for VLSI design. The achievable performance, measured by signal processing speed, delay and implementation complexity, is mathematically analyzed. A typical example shows that the processing of 2.5 Gbits/s and delay of 0.2 ms are achievable by a single chip decoder. The suggested algorithm and VLSI implementation can universally be utilized because of the following features. 1) Given a mother code of RS over GF , a variety of codes , , over GF can simply be obtained, and the same VLSI chips can be used for the variable codes.
2) The number of packets in a block usually is less than 100 because of the restriction of concatenation delay and memory size. The maximum packet loss rate is around 10-30 percent. The code length and the minimum distance of the suggested mother code in the example can cover those requirements.
3) The processing speed is several Gigabytes per second, which is greater than the data transmission rate required in practice.
4) The processing delay is less than one millisecond even for a large-sized concatenation (50 1024 bytes), which is negligible even for real-time communications.
5) Consequently, the suggested codes and chips can flexibly be utilized for the implementation of different forward error correction (FEC), hybrid ARQ, unequal packet loss protection, unequal data protection, adaptive redundancy protection in point-to-point communications, multicast and broadcasting, over DVB/DAB networks, Internet and wireless Internet.
