Among the various tasks performcd by sofhvarc radios is thc reconfiguration of the error control coding algorithm to match the requirement of the radio personality. In the digital radio processor, proper assignment of tasks between DSPs and FPGAs provides perfornuncc improvements over the nsc of DSPs alone. Error control coding functions are good candidates to reside on the PPGA side of this ftinctiiinal partition. Unfortunately, good VLSI designs for codes using BCH or Reed-Solomon codes do not map well to FPGAs. Good FPGA designs must parallelizc at every opportunity, minimize timing delays through intelligcnt floor planning, and IISC each logic block LO its fullcst. We dcmonstrate thc merits of these concepts by comparing the performance of popular finite field multiplier designs.
the regional standard. This allows regional and global roaming (especially in the United States), and reduces thc cost of introducing new technology over existing infrastructurc. For example, it allows mobile uscrs to maintain sewiceability in all regions during uneven Global System for Mobile Communications (GSM) to Universal Mob Telecommunications Systcm (UMTS) conversion [l] . When the radio personality may be totally redefined in software, we havc a software defined radio [2] . Base station flexibility provides an economically attractive solution to infrastructure evolution. By using a common radio frcquency (RF) front-end for multiple channels and inexpensive digital hardware for cach channel, the ability to use software reprogrammability for upgrades o r performance improvements without physically removing hardware offers a tremendous economic advantage. In both the handset and base station cases, the ability to assiimc multiplc personalitics via a software change overcomcs many of the economic hurdles of standards evolution using high-performance point solution hardwarc.
The canonical software defined radio consists of a powcr supply, an antenna, a multibandibroadband R F front-cnd, and a digital radio processor [3, 41, as illustrated in Fig. I . The digital radio processor, which consists of a wideband analog-digital-analog (AIDIA) converter and field programmable gate array (FPGA) and/or digital signal processing (DSP) hardware, performs the radio functions and interfaces required by thc radio personality (e.g., modulationidemodulation, synchronization, equalization, chanuel coding, source coding). In practice, the digital radio proces-
The United States Air Force has approved this article forpublic rdease.
sor consists of a wideband AID converter to digitize the entire service band at intermediate frcqucncy (IF), programmable digital filters for channel isolation, and a DSP together with a rcconfigurable FPGA [2] . The radio functions associated with the radio personality are controlled through software. In this way, the desired flexibility is rcalized by creating new software which rcdcfine the parametcrs and functions of the digital radio proccssor.
The key feature is the complete software-rcconfigurability of thc digital radio processor. This reconfigurability may be achieved cither through software (different DSP subroutines) or elcctronically reconfigurable hardware (FPGA configuration files) [2] . T h e trade-offs between DSPs and FPGAs are fluid and technology-dependent. Many radio functions are best pcrformed by DSPs, while others are best performed by FPGAs. A judicious combination of FPGAs and DSPs, each programmablc in their own way, offers a potent hardware solution for the digital radio proccssor, as illustratcd in Fig. 1 . FPGAs have been successfully applied to several traditional DSP applications such as beamforming [5] and filtering [6] .
This article explores the applicability of FPGAs to the error control coding chore required by the digital radio processor. Both convolutional and block codes are popular in the various wireless standards. Block codes, such as Bose, Chaudhuri and Hocquenghem (BCH) and Reed-Solomon codes [7] requirc a decoder capable of performing arithmetic operations in finite fields. A comparison betwecn application-specific integrated circuit (ASIC), FPGA, and DSP implementations of the decoders shows that the performance of FPGA-bascd designs lean more toward that of ASICs but rctain flexibility more like DSPs [8] . This illustrates an important example of a high-performance partitioning between the DSP and FPGA in the digital radio processor.
Experience with FPGA-based finite field arithmetic shows that design proccdures that work well in ASIC layouts do not always produce dcsirable FPGA realizations. We cxplorc these differences and suggest some gcneral rules that produce cfficient FPGA designs. 
Possible Technology Solutions
The basic building block of an ASIC is the transistor. This leads to dcsigns that rcalize the absolute minimum required area for the application. Additionally, this transistor basis also eliminates any unnecessary data path and makes ASICs the fastest possiblc solution as well. The major ASIC disadvantage is inflexibility. Once implemented, ASICs are very difficult to modify. Thus, if a designer wishes to support multiple error control applications through ASICs, hc or she must include a number of different designs. This is inefficient because it results in an unacceptable amount of area allotted to idle ASIC devices. Some ASICs achieve a limited amount of flexibility through parameterization (i.e., the ASIC performs the cncodingidecoding functions required by the family of length 255 Reed-Solomon codes). Nevertheless, each added parameter adds both area and time delay to the ASIC device. Furthermore, parameterization limits the applicability of the radio only to those future standards incorporating error control using a member of the code family. Thus, one cannot significantly incorporate flexibility in an ASIC design without decreasing the ASIC's advantages of speed and area [6] .
In contrast to ASICs, DSPs provide a great deal of flexibility to the error control standard designer, and examples abound of error control applications implemented in DSPs. The chief advantage of this approach is that the DSP can easi-
W Figure 2.A simplified block diagram of a matrix-based FPGA.
ly change from one error control standard to another through a change in programming. Unfortunately, the DSP pays for this flexibility with additional data paths and logic that are underutilized or inefficient for the specific task at hand. Memory accesses, instruction fetches, complex addressing schemes, and muxable data paths bring flexibility to the DSP, but add cost in terms of area and time. As a result, a DSP implementation of a given error control standard is much slower, takes more area, and draws more power than a corresponding ASIC implementation [6, 81.
Between the two extremes of ASICs and DSPs reside FPGAs. Through programming, FPGAs approach the flexibility of DSPs. Nevertheless, their unique architecture gives them speed and area performance leaning more toward the ASIC side of the performance spectrum. Currently, a host of FPGAs in a variety of designs exist. One of the most popular designs, however, is a matrix of logic blocks (LBs) arranged in columns and rows. These blocks are interconnected by sets of wires running horizontally between rows and vertically between columns. Collectively called interconnect, these wires allow for local, regional, and global communication across the FPGA device. Switch boxes embedded in the interconnect allow for maximum flexibility in routing signals between logic blocks. Figure 2 illustrates a simplified block diagram of a matrix-based FPGA.
T h e logic blocks provide the FPGA's computational power. These blocks normally consist of from two to eight lookup tables (LUTs) which form the core of the FPGA's logic processing capability. Each LUT normally has between four and six inputs, and one output. This output can be routed directly to another logic unit or into a register. The combination of an n-input LUT and an associated register is often treated as the basic architectural unit of the logic block. Figure 3 shows a stylized block diagram of this LUT-register pair.
The four-input LUT illustrated in Fig. 3 can realize any four-input logic function. In general, an n-input LUT is guaranteed to realize any logic function of n Boolean variables. This result is routed immediately outside the LB for combinatorial logic or to a register to effect sequential designs. Connecting either output to the LUT inputs of other LBs realizes even larger logic functions.
The contents of each LUT as well as the interconnect and switch box configurations that control routing are specified by the contents of a configuration file. This file is normally generated by a set of programs provided by the FPGA manufacturer. These programs take a textual description of the intended circuit called a net-list and convert it to the appropriate data to implement the circuit on the FPGA. This data consists of both commands to the switch boxes and muxes as well a s the contents of each LUT involved in the design. Changing the configuration file effectively changes the circuit implemented on the FPGA. A small detractor is that the loading of a new design, called reconfiguration, is not without cost. It can take hundreds of milliseconds to reconfigurc an FPGA with a new design. This reconfiguration time can be reduced through buffering, partial reconfiguration, and other strategies, but designers must take it into account. Overall, however, the flexibility associated with the configuration file, and its control of the interconnect and LUTs usually outweigh the cost of reconfiguration.
Ovcr the past I5 years, FPGAs have enjoyed phenomenal growth in both size and speed. Figure 4 shows the growth of the Xilinx 4K FPGA family [9] .
Today's chips realize 20,000 LUTs with the approximate processing power of 250,000 gates. This is more than sufficicnt to implement even the most demanding error control application. Already FPGA designs exist that implcment codecs for the length 255 Reed-Solomon codes. Our own in-housc work has successfully implemented a Recd-Solomon decoder for the RS(15,9) code used in military radios. This application uscd 2800 of the 4600 LUTs available on a Xilinx XC4062 chip and ran at a ratc of 30 Msymbolsis. These size and performance numbers arc extremely competitive with current ASIC and DSP performance standards rcported in the literature today [IO] . Thus, we know that FPGAs can implement error control circuits. Nevertheless, little is known about how to optimize these circuits to get the best performance.
The key advantages of FPGAs over ASICs and DSPs is their unique combination of performance and flexibility. Rescarch has shown that FPGAs can significantly outperform DSPs in terms of speed, size, and power requirements in certain applications [5, 8, 1 I, 121 . In addition, FPGAs also enjoy flexibility far in excess of ASICs. Figure 5 graphically illustrates the merits of FPGAs rclative to ASICs and DSPs.
Wc see from this chart that FPGAs capture in large measure the best of both worlds. They possess DSP-like flexibility far in excess of ASICs, but they also achieve ASIC-like speed, area, and power performance, far outperforming DSPs for certain applications. 
Finite Field Devices in FPGAs
Decoders for currently popular block codes such as BCH and Reed-Solomon codes perform arithmetic operations in finite fields. GF(2"') is the finite field or Galoisfield containing 2"' elements, each of which is an m-dimensional vector over the binary field. Thus, each elemcnt can be nomial with binary coefficicnts. Addition is easy: an exclusive-OR operation coefficicnt by coefficient. Multiplication, on the other hand, is much morc difficult. It is most convenient to think of the product of two finite field elements as the product of the polynomials which reprcsent them -reduced modulo a degree rn piimitive polynomial which generates the field. Two realizations of this conccpt are illustrated for the case of GF(16) in Figs. 6 and 7. In Fig. 6 the direct application of this concept i\ shown in the form of a linear feedback shift rcgister (LFSR) [13] . The alternative is to express cach of the four coefficients in the product as a Boolean function of the eight input coefticients, a5 shown in Fig. 7 . Here thc four Boolean expressions are repicsented as a matrix this design is a gatc count reduction rclahve to the firsttwo. The Wang multiplier uses an alternative polynomial reprcsentation to produce a combinatorial realization that is more regular than the Mastrovito and Paar-Roaner multipliers [17] . Table 1 compares the relative merits of the abovc implcmentations in the Xilinx XC4062 FPGA.
In the table, the time column represents the total time necessary to compute the product of two GF(16) elements. The last two columns represent the area-time product (in LB-ns) and the area-time-squared product (in LB-ns2), respectively. Ideally, we would like our finite field multipliers to take zero time and use zero resources. Thus, thc smallcr numbers in the last two columns indicatc bettcr performance taking both speed and area into account. The last column squares the time value to place special emphasis on speed. For comparison, we include in the last row of Table 1 a direct, two-level combinatorial realization of the matrix equation.
In gencral, designs optimized for VLSI performed worse than the two-level combinatorial design. This is explained as follows. Each VLSI design possessed features that hindered its pcrformancc on an FPGA. The LFSR design, although small and fast, accepted data only one bit at a time. This had the immediate effect of increasing the number of clock cycles and, hcnce, the time rcquired to produce a result. Mastrovito's design consistcd of four levels of logic due to the exclusive use of two input logic gates. This number of logic levels translated into FPGA delays as signals flowed from LB to LB. This, in turn, decreased the multiplier clock rate. The Hasan-Bhargava design used a number of small computation units and delay lines. Each delay and computation unit required an LB to implement, but underutilized thc block's logic proccssing capability. As a result, this multiplier used more LBs than were necessary to solve the problem. The Paar-Rosner multiplier suffered from the same detrimcnt as Mastrovito, rcquiring multiplc layers of unclocked logic. This incrcascd wirc dclay through the FPGA structurc and decrcascd thc clock rate. Wang's multiplier required a number of shift registers and barrel shifts to work propcrly. As a result, Wang's multiplier suffered from the same disadvantages as the Hasan-Bhargava multiplier. Specifically, Wang's design used more LBs than necessary to solve the problem due to LB underutilization. In contrast, thc dircct two-level combinatorial design suffered none of these flaws.
Mastrovito-I1
As wc analyzed the performance of these multipliers, three general trends emerged:
Parallelism results in faster performancc.
-Levels of unclocked logic must be minimized to achieve the The use of each availablc logic unit should be optimized. These observed trcnds ccho those rcported by others [6] , and optimal designs for FPGAs must keep the abovc trcnds in mind.
Incorporating the above trends, we modified the LFSR, Mastrovito, Paar-Rosncr, and Wang multipliers. These results are summarizcd in Tablc 2.
A quick comparison bctwccn tablcs shows that incorporating our general trends into a design can substantially incrcasc the performancc of thc dcvicc. Thcre is a caveat. We note that the modified Mastrovito multiplier did not outperform the standard Mastrovito design in terms of the time area product. This rcsult cmphasizes one of the key trade-offs we 3bscived in implcmcnting finite field multipliers in FPGAs. In x d c r to increase thc clock rate of the Mastrovito design, we nere forced to add LBs. Our approach was to cnsurc that no signal traveled more than one LB bcforc being registcrcd and thus maximizcd the clock rate. Unfortunately, the added LBs ierved as mere registers or very simple gates. Thus, the sum total of LBs in this Mastrovito-11 multiplier represents marc processing resources than are strictly needed to solve the problem. The key trade-off in FPGA design is, thus, that thc qucst to maximize clock rate often adds resources; and conversely, the quest to minimizc rcsources often decreases clock rate. The optimal dcsign must balance these two variables.
fastest specd.
Conclusions and Future Work
As a result of our investigations, we concludc that FPGAs arc attractive platforms for implementing error correction applications whcrc thc exact application must change over time. Through our own work and that of othcrs, we know that FPGAs posscss thc proccssing capability and specd iieccssary to implement competitive error control standards. Nevertheless, we have shown that pcrformance in FPGAs is dependcnt on design. Optimal designs in FPGAs cannot be realized by merely porting designs optimized for VLSI. Careful considcration must bc given to the underlying architecture and capah es of the target FPGA: any optimal FPGA dcsign must bc parallel in naturc, optimizcd in use of each logic block, and limitcd in levels of unclocked logic. This reasoning extends to other dcviccs hosted on FPGAs as well. As GF(16) multiplicrs are relatively small, these observations may be cven more pertinent to other logic circuit implcmcntations. Futurc work needs to conccntratc on gencral rulcs for mapping algorithms to FPGAs that take into account thc FPGA architecture and granularity. In addition, FPGAs suffer from a current lack of mature and robust computeraided design (CAD) tools. Most FPGA designs arc currently done at the hardware lcvel using ii hardware dcscription language such as VHDL. Algorithm designers who use high-level languages such as C, Fortran, or signal processing packages do not think in terms of VHDL.
This mismatch must b e addressed if FPGAs are to h e incorporated into the signal processing functions required by softwarc radios. Ultimately, we hope to lay the foundation for incorporating FPGAs as the preferred implementation media for error correction in software radios.
