). Research results obtained in the areas of architectures for product turbo coders (based on component codes such as BCH codes, extended Hamming codes, and single parity check codes), space-time block codes, low-density parity check (LDPC) and long BCH codes are described. Efficient implementations of AES cryptosystems are described. Architectures for ultra wideband communication systems are summarized. Erasure decoding in Reed-Solomon codes and some preliminary results on soft-decision ReedSolomon decoders are outlined.
High-Speed BCH Turbo Product Decoder
In this work, a sub-optimal algorithm for decoding BCH (t>=2) turbo codes has been developed. High speed VLSI decoder architectures have been proposed for codes constructed over extended GF (2 5 ). While the algorithm applies to higher order BCH product codes, it is shown that this particular block turbo code, when decoded using the proposed algorithm, gives the best performance (achieving 10 -6 } bit error rate at a signal to noise ratio of 2.4 dB) among all two dimensional turbo product codes. Following an analysis of the impact of finite word-length effect on the performance of the SISO decoder, high-level architectures for full parallel decoding have been developed. Lower level high speed implementation strategies such as application of look-ahead technique to reduce the critical path of the merge sort circuit and fast finite field operations have also been developed. Area and timing estimates obtained by logic synthesis (0.18 micron, 1.5V CMOS technology) from VHDL descriptions show that a throughput of >32M bits/s can be achieved. 
Extended Hamming Block Turbo Coder
To explore time diversity to combat channel noise, redundancy can be introduced to the transmitted information. Block turbo code introduced in 1994 is one of the most powerful error control codes with performance approaching Shannon limit. Unfortunately, its decoding complexity is very high. A very low complexity block turbo decoder composed of extended Hamming codes has been proposed. New efficient complexity reduction algorithms are proposed including simplifying the extrinsic information computation and soft inputs updating algorithm. For performance evaluation, [eHamming (32,26,4)] 2 and [eHamming (64,57,4)] 2 block turbo code transmitted over AWGN channel using BPSK modulation are considered. Extra 0.3dB to 0.4dB coding gain is obtained when compared with the most recent schemes and the hardware overhead is negligible. The complexity of our new block turbo decoder is about ten times less than that of the near-optimum block turbo decoder with a performance degradation of only 0.5dB. Other schemes such as reduction of test patterns in the Chase algorithm and memory saving techniques have been considered. 
Turbo Product Codes with Single Parity Check Component Codes
Both complexity and performance aspects of serially concatenated 2-D single parity check turbo product codes were investigated. The extremely simple Max-Log-MAP decoding is alternatively derived with only three additions needed to compute each bit's extrinsic information. A parallel decoding structure has been proposed to increase the decoding throughput while a new helical interleaver is constructed to further improve the coding gain. For performance evaluation, (16, 14, 2) 2 single parity check turbo product codes with code rate 0.766 over AWGN channel using QPSK are considered. The simulation results using Max-Log-MAP decoding show that it can achieve BER of 10 -5 at SNR of 3.8dB with 8 iterations. Compared to the same rate and codeword length turbo product code composed of extended Hamming codes, the proposed scheme can achieve similar performance with much less complexity. Other implementation issues such as the finite precision analysis and efficient sorting circuit design have been addressed. 
Reduced Complexity Space-Time Block Codes
A computationally efficient algorithm has been developed for the soft decoding of spacetime block codes. Compared to the original maximum likelihood algorithm, the proposed algorithm saves up to 80% hardware operations. The simulation results using space-time block turbo coded modulation scheme show that the proposed algorithm achieves the same decoding performance as the maximum likelihood decoding with much lower complexity. Two approaches of block turbo code with antenna diversity have been compared in terms of bit error rate (BER) performance under various configurations. One is block turbo code with multiple transmit and receive antennas (BTC-Diversity system), the other is the serial concatenation of block turbo code with space time block code (BTC-STBC system). The latter still achieves better BER performance even with the constraint of same spectral efficiency. The implementation issues such as algorithm complexity reduction scheme without performance loss and corresponding structure, new early stopping criterion and proper interleaver choice for different fading channels have been investigated for the concatenated system. 
Regular LDPC code and decoder design and Architectures
In the past few years, Gallager's Low-Density Parity-Check (LDPC) codes have received a lot of attention and many efforts have been devoted to analyze and improve their errorcorrecting performance. However, little consideration has been given to the LDPC decoder VLSI implementation. In this work, a joint code and decoder design approach was proposed to construct a class of (3, k)-regular LDPC codes which exactly fit to a partly parallel decoder implementation and have a very good performance. In addition, a high-speed (3, k)-regular LDPC code partly parallel decoder architecture has been developed. Based on this, a 9216-bit, rate-1/2 (3, 6)-regular LDPC code decoder has been implemented on Xilinx FPGA device. When performing maximum 18 iterations for each code block decoding, this partly parallel decoder supports a maximum symbol throughput of 54Mbps and achieves BER 10 -6 at 2dB over AWGN channel. To the best of our knowledge, this was the first LDPC decoder FPGA implementation reported in the open literature at that time.
Novel overlapped scheduling techniques have been developed for quasi-cyclic LDPC codes which can reduce the number of clock cycles by upto factor of 2. Novel hardware sharing techniques have been developed to reduce the data path hardware complexity of LDPC decoders.
Architectures for Long BCH Encoders
The speed in long BCH encoders is limited by the feedback loop of the linear feedback shift register. Several approaches were developed to circumvent the speed problem and the fanout problem simultaneously in long BCH encoders.
For the case of decoders, novel substructure sharing approaches were developed to reduce the hardware complexity of the Chien search part which is the most computationally complex part of the decoder. 
Architectures for AES
Various approaches for efficient hardware implementation of the Advanced Encryption Standard algorithm have been studied. The optimization methods can be divided into two classes: architectural optimization and algorithmic optimization. Architectural optimization exploits the strength of pipelining, loop unrolling and sub-pipelining. Speed is increased by processing multiple rounds simultaneously at the cost of increased area. Architectural optimization is not an effective solution in feedback modes. Loop unrolling is the only architecture that can achieve a slight speedup with significantly increased area. In non-feedback modes, sub-pipelining can achieve maximum speedup and the best speed/area ratio. Algorithmic optimization exploits algorithmic strength inside each round unit. Various methods to reduce the critical path and area of each round unit are exploited. Resource sharing issues between encryptor and decryptor become important when both encryptor and decryptor need to be implemented in a small area. 
Ultra Wideband Systems
In [24] , the power spectrum of fast-frequency hopping (FFH) multi-carrier (MC) ultra wideband (UWB) communication systems was analyzed in detail and compared with single carrier (SC) UWB systems. According to the analysis, the MB UWB system is easier to fit into the power spectrum mask for UWB systems defined by FCC and thus achieves a higher transmitting efficiency. At the same time, based on the analysis, it is possible to design the parameters of a practical MC-UWB system. In [25] and [26] , a novel algorithm, SD-KB algorithm, was proposed by combining the sphere decoding and K-best algorithms to solve the maximum likelihood detection (MLD) problem suboptimally. The new algorithm dramatically reduces the computation complexity compared with the sphere decoding algorithm in low SNR range. Specifically, this new algorithm was applied to the simultaneously operating piconets (SOP) problem in multiband OFDM (MB-OFDM) systems. Compared with the baseline FFT and equalization detection algorithm, by applying the new algorithm, the performance of the system in SOP can be enhanced up to 4dB. In [26] , the algorithm was further improvement by dynamically choosing the division points of the sphere decoding and K-best parts in the SD-KB algorithm. In [27] , a new pulsed-OFDM (P-OFDM) system was proposed for the UWB communication based on the baseline MB-OFDM system. The new P-OFDM system can achieve similar or better performance than MB-OFDM system by balancing coding gain and diversity gain. Meanwhile, it requires a lower implementation complexity with a smaller size FFT/IFFT processor, up-sampling, and time-multiplexing of the FFT processor.
