I. Introduction
Ultra-wideband (UWB) systems have been receiving much attention in recent years, mainly due to their high data rate capability with low transmit power [1] - [3] . Specifically, a multiband orthogonal frequency-division multiplexing (MB-OFDM) based UWB system supporting the maximum data rate of 480 Mbps is widely considered [1] . In the MB-OFDM UWB system, a rate 1/3 convolutional code is utilized. The Viterbi algorithm is generally used to achieve near optimal decoding performance.
Recently, there have been many studies on high-speed Viterbi decoders (VDs). A systolic solution of the M-step parallel processing architecture was proposed in [4] . Because a k-fold increase in throughput requires a 2 k -fold increase in hardware complexity, M is usually limited to less than 3. A bitlevel pipelined structure was studied in [5] . It solves the timing constraints by retiming the critical path; however, both power consumption and hardware complexity increase to support data rates of hundreds of Mbps. A minimized method and sliding block methods have also been studied for high-speed processing [6] . To greatly increase the data rate, a systolic design has been employed. Because the above methods have their own speed limits and/or hardware complexity, an appropriate architecture should be implemented.
II. Two-Stage Radix-4 Viterbi Decoder
In this letter, we present a power efficient two-stage radix-4 VD architecture, which can support the maximum data rate of 480 Mbps in the MB-OFDM UWB system. 
BM Unit
Branch metrics (BMs) for the radix-4 trellis are generated by combining branch metrics of successive iterations of the underlying radix-2 trellis. Three 4-bit soft decision input signals are used to calculate eight sub-branch metrics corresponding to the eight possible encoder outputs in the radix-2 trellis. Because the radix-4 VD utilizes six 4-bit soft decision inputs for two decoded outputs, the maximum branch metric is 90. The branch metrics are calculated using a uniform distance measure equal to the symbol itself when compared to logic-0 and equal to its one's complement when compared to logic-1 [6] .
ACS Unit
An example of a four-way ACS unit for state 0 is given in 
where i is a predecessor state of s, and , i s n λ denotes the branch metric on the transition from state i to state s at time n.
The two-way add-compare circuit of the ACS unit is described in detail in Fig. 4 . By employing the modulo normalization algorithm from [7] , we can avoid errors due to overflow during the updating of the state metrics and simplify the comparator circuit of the ACS unit as in Fig. 4 . Note that the output of the two-way add-compare logic is 0 if A is larger than B; otherwise, the output is 1.
TB Unit
The proposed VD architecture employs the 3-pointer even algorithm for trace-back recursion [8] , which is more hardware-efficient than the register exchange algorithm. Figure 5 shows the memory banks, the last-in first-out (LIFO) buffers, and the block decoding method of the proposed VD III. Implementation Results Table 1 compares the hardware complexity, the maximum operating speed, and the power consumption in the ASIC implementation of a class of ACS architectures. The ACS architectures are implemented and tested by utilizing the TSMC 0.13-µm and 0.18-µm CMOS libraries with the operation condition of slow mode. Because the sampling frequency of the MB-OFDM UWB system is 528 MHz [1] , the radix-2, radix-4 (or two-stage radix-2), and two-stage radix-4 ACS architectures should be operated at the clock speeds of 528 MHz, 264 MHz, and 132 MHz, respectively [6] . Although the pipelined radix-4 architecture satisfies the timing constraints, it requires much greater hardware complexity and power consumption than the radix-4 or two-stage radix-4 architecture [5] . In the radix-4 architecture, only 0.13-µm technology can support the required operation speed. To make matters worse, the power consumption of the radix-4 architecture is approximately 49% higher than that of the twostage radix-4 architecture. Table 1 indicates that the proposed two-stage radix-4 VD is the most power efficient architecture for the MB-OFDM UWB systems. We summarized the hardware complexity of the two-stage radix-4 VD implemented using 0.13-µm CMOS technology in Table 2 .
IV. Conclusion
We have proposed a power efficient two-stage 64-state radix-4 VD architecture. Implementation results showed that the proposed VD with relatively low hardware complexity and power consumption can support various data rates for MB-OFDM UWB transmission. As ASIC technology evolves, the proposed VD architecture is expected to support data rates of more than 1 Gbps and accordingly is suitable for use in next generation high-speed communication systems.
