Abs tract-This paper presents a new trace-back memory structure f or Viterbi Decoder that reduces power consumption by 63% compares to conventional RAM based design. I nstead of the intensive read and write operations as required in RAM based designs, the new memory is based on an array of registers connected with trace-back signal s that decode the output bits on the f l y. The structure is used together with appropriate cl ock and power-aware control signal s. Based on a 0. 35 µm CM OS impl ementation the trace-back back memory consumes energy of 182 pJ.
Index Terms-Viterbi decoder, trace back memory, l owpower, channel coding, convol utional code
I . I NTRODUCTI ON
Convol ution coding is widel y us ed in modern digital communication s ys tems s uch as mobil e or s atel l ite communications to achieve l ow-error rate data trans mis s ion. The Viterbi al gorithm [ 1] , in particul ar, is known to be an ef f icient method f or the real is ation of maximum-l ikel ihood ( M L) decoding of the convol utional codes . Today Viterbi Decoder is widel y us ed in es tabl is hed s ys tems s uch as GSM mobil e or I EEE 802. 11a wirel es s LAN s tandard. W ith emerging appl ications s uch as Digital Audio Broadcas ting ( DAB) , Digital Video Broadcas ting ( DVB) or wearabl e pers onal entertainment devices , wirel es s communication is al s o increas ingl y becoming more pervas ive. Thes e appl ications require devices with ul tra l ow power cons umption. Al ready it has been s hown that the Viterbi decoder can account f or more than one third of power cons umption during bas eband proces s ing in s econdgeneration cel l ul ar tel ephones [ 2] . Power cons umption is theref ore the critical des ign criteria to be tackl ed.
An exampl e in Figure 1 s hows a 4-s tate ( K=3) convol utional s ys tem with the coding rate ( number of input bits / output bits ) R of ½. The corres ponding Trel l is diagram is s hown in Figure 2 . The s tates are pres ented in Y-axis , with timing in X. For each branch between each pair of s tates , the output val ues f or that trans ition is s tated. A convol utional code is of ten repres ented as ( n, k, K) c ode where n is the number of input bits , k the number of output bits , and K is the encode cons traint l ength. To decoder data, a conventional Viterbi decoder cons is ts of 3 maj or parts that are: s tates of the decoder trans f erring to a next s tate under particul ar s et of input s ymbol s . The ACSU compares the res ul ts to f ind maximum l ikel ihood f or each s tate, updates its path metrics , and generates a decis ion bit that uniquel y identif y the previous or s urviving s tates . Thes e decis ion bits are s tored in SM U and us ed to recons truct the mos t l ikel y s tate s equence. Viterbi decoder has been a subject of ongoing research, the most recent ones such that [3] [4] [5] concern high-speed implementations. Approaches for power reduction have also proposed [6] [7] [8] [9] [10] . In this paper we introduce low-power design techniques targeting specifically the key power hungry block within the Viterbi decoder, which is the SMU. An SMU usually employs either register exchange or trace back techniques. In register exchange designs [11] the decoded bits for each state under maximum likelihood path are stored directly onto the state registers. The approach consumes significant power and is suitable only for designs with short decoder length, as the decoded data stored, which needs constant read/write updates, increases with the length of the decoding process. For systems that require longer decoding length for better bit error rate, such as those in GSM or CDMA, the trace-back technique [12] is usually employed. Using trace-back algorithm, only information which identifies previous unique states is kept in the memory, meaning only one bit/state/stage is required. After a certain decoder length the information is read out and decoded to find the most likely incoming bit sequences received during that period. The trace-back SMU is conventionally realised using RAM, and as a result it has been shown to contribute more than half of the power consumption in a conventional Viterbi decoder due to the expensive memory accesses [8] .
This paper proposes a new memory structure where decoded bits are presented without having to be read out. The new memory is based on an array of registers connected with trace-back signals that decode the output bits on-the-fly. The structure is used together with appropriate clock and power-aware control signals. The detail implementation is given in the next section.
II. PROPOSED TRACE-BACK MEMORY ARCHITECTURES
The new memory structure is based on connections of new cell elements, each has a structure as shown in the Figure 3 . The cell consists of a register with additional logics to allow the stored trace-back information to be read. The logics are also connected to other cells such that, based on the bit information stored, it can uniquely identify the previous or surviving state so that the decoded bit can be found. An example of the memory structure for a 64-state Viterbi decoder with a decoding length of 35 is shown in Figure 4 . Potentially power consumption is minimised as, once stored in the register bits, the result are already presented without have to be read out again as in the case of using RAM. The memory structure is also designed such that only once the winner is decided upon that the signals are to be propagated through the array to decode the output bits. Apart from that period, no switching of decoding logics occurs. In terms of writing the data onto the registers, two approaches are considered. The first approach, termed systolic ( figure 5) , is when the data propagates through the stage (or column) resisters under the same clock rate. For the second approach, termed parallel (Figure 6 ), data is connected to all the column registers in parallel, each clocked at clk/35 rate and appropriately skewed. The results are discussed next. 
III. RESULTS AND DISCUSSIONS
To test the effectiveness of the new memory structure, a Viterbi decoder that follows DAB specifications is constructed. The DAB specifications have a constraint length L of 7 (hence 64 states) rate R of ¼ , with the encoder as shown in Figure 7 . Although the memories considered can be easily adapted for both single and multiple ACU cells approaches, for simplicity the Viterbi structure implemented here has a single state ACS unit, requiring 64 clocks per one stage operation. Also, as it is generally considered that the decoder length should be about five times the constraint length, the trace-back has memory depth of 35. The block diagram of the decoder is as shown in Figure 8 .
The Viterbi decoder based on the three SMU architectures that are RAM based, the proposed systolic and parallel forms are implemented using Verilog HDL and are functionally verified both by simulation on Mentor Graphic' s ModelSim and for Xilinx FPGA implementation. Examples of the simulation result showing the decoders correctly decoding the sample data (hex 713922f2b) are shown in Figure 9 . The designs are synthesised for ASIC implementation of SMU using Mentor Graphic' s LeonardoSpectrum using 0.35µm CMOS technology. Power dissipation is estimated based on switching activity observations and information provided by [13] . The results are given in Tables 1 and 2 . Table 1 shows results of FPGA implementations of the Viterbi decoder employing different types of memory. It can be seen that the RAM based design uses significantly less resource. This is due to the fact that FPGA is a RAM based technology and so the resource required for realising arrays of FFs as required by systolic and parallel structures is considerable. This is particularly true for heavily routed structure such as the parallel design, where its delay is also significantly increased. FPGA is however, inherently a power-hungry technology needing large static current, and so the power aspect is not considered and the implementations are only for verification purposes. For low-power ASIC implementation, the SMU based on the systolic and parallel designs are implemented and compared to the 2Mbit macro block RAM given in [13] that can be employed for the same purpose. It can be seen that the parallel design consumes much less power (energy). This is due mainly to the reduction in power due to low switching FFs given slower rate of clocking for each column registers. This is compared to the FFs in the systolic design that switch, depending on the particular stage registers, at a rate much closer to the decode rate. The reduction in register power is traded-off slightly by the increase in power due to increase in routing resulting in larger power for interconnects and gate capacitances. Compare to RAM, the power consumption of the new trace-back memory based on the parallel design is only 36.4%. The result also compares favorably with previously reported design such as [14] which uses similar CMOS technology. Given that SMU contributes more than 50% of the total power consumed by a Viterbi decoder, by using the new memory structure the reduction in power in total is potentially more than 30%. In the cases where an increase in the area is not priority, using register based design is also more appropriate for soft IP approach, as the code can be portable more easily compared to designs using RAM where a macro RAM block usually needs to be provided by the vendor. In terms of speed, Table 2 also provides the delays of both new memory designs. Although the values are adequate for most applications, implementation using a deeper submicron technology will also allow the memory to be used in more demanding, ultra high-speed applications. A modification to the structure for high-speed applications is also potentially possible and is the subject of our ongoing work.
IV. CONCLUSIONS
A new trace-back memory structure for Viterbi Decoder that reduces power consumption by 63% compares to the conventional RAM based design is proposed. The new memory is based on an array of registers connected with trace-back signals that decode the output bits on thefly. The structure is used together with appropriate clock gating and power-aware control signals. Based on 0.35 µm COMS implementation the trace-back back memory consumes and energy of 182 nJ .
