In future radio systems, flexible coding and decoding architectures will be required. In case of the latter, implementing architectural flexibility with regard to low power issues is a challenging task. The flexible encoding platform in this paper is a first step toward this envisioned decoder. It generates a wide class of codes. starting with convolutional codes. As an extension to this, turbo codes will be included by adding an interleaver. At this prototyping stage, the system is implemented on an F'PGA. The decision to choose the observer canonical form is defended by a thorough investigation of its critical path properties. Proper configuration allows code rates of blc, b = 1 . . .15, c = 2 . . .16, b < c. Power can be saved by shutting down unused system modules.
INTRODUCTION
Wireless personal area networks (WPANs) which provide shortrange ad hoc connectivity among different communication devices will be both multi-standard and multi-rate. A gadget in this envisioned environment must be flexible enough to be able lo establish communication for a wide variety of applications, reaching from low bit-rale sensor networks to high bit-rate WLAN communication as depicted in Figure 1 . Since these devices will be mostly battery-operated, efficient power and energy management is essential for realization. However, there are several orders of magnitude in power efficiency between a flexible software solution and a fully hardware mapped one. Clearly, there will always be a trade-off between flexibility, processing performance and power consumption.
Channel coding and decoding algorithms are key components in a communication system and will vary accordingly due lo the mentioned application variety. Although these algorithms have different properties, there are still common blocks that can be identified. Hence, one has to find an efficient way to reuse hardware far different types of tasks in these algorithms. This project aims at a decoder that should be able to handle a predefined set of algorithms by sole configuration of its hardware with suitable parameters. These parameters include, for example, precision, speed, and type of code. As a result, the grade of achieved flexibility could be measured based on the limits in performance and power consumption. indicating a range of feasible solutions. This paper presents an F'PCA-based flexible canvolutional encoder which supports power saving by shutting down unused parts through specific enable logic. Care has been taken that the design can be easily upgraded for future purposes: for example, by adding an interleaver block, this design turns into a turbo encoder.
In Seclion 2 several possible architectures are introduced with a careful look at critical path properties. Section 3 presents the system specification and implementation aspects will be addressed in Section 4. Synthesis and performance results are then outlined in Section 5 . The paDer ends with a conclusion and a look at future research tasks. Figure 3 visualizes T,;, as a function of the respective coding polynomial, that is, T,;,,(f,q). According to (4) and (3, the xand y-axis show the indices of the feedback and feedforward polynomials resulting from the maximum and minimum functions, respectively. The minimum clock period T, . , is then drawn on the z-axis. By looking at these graphs it is obvious that this architecture is very sensitive to setting coefficient fa to I . In this case, the complete feedforward path contributes to the critical path. Hence, the longest path occurs when fo and either qm-, or q, are set to 1 , resulting in a minimum clock period of 2m unit times. Reordering the cells in Figure 2 according to the law of associativity to form the architecture depicted in Figure 4 
ARCHITECTURAL ISSUES

17)
Since addition and subtraction are equivalent operations in h , the realization of (7) is represented in Figure 7 and is calkd observer canonical form. In this case, the delay elements do not ioIm a shift register anymore since they are separated by modulo-2 adders.
Contrary to the preceding canonical forms, the worst casecritical path of the observer canonical form does not depend on the coding polynomial, since this path, given a recursive function, always consists of two XOR-cells in series. However, the drawback is that a single XOR-cell in this architecture will have a larger prapagation delay since the operation has to be performed on two logic levels. If the propagation delay of this )-input cell is two times the delay of the mentioned basic XOR-cell, the minimum clock period will always be 4 unit times.
Concluding these considerations, it is clear that the observer canonical form guarantees a critical path that is independent of 
SYSTEM SPECIFICATION
In order to cover a wide range of convolutional codes there should he a certain number of polynomials generated simultaneously. The memory of a single encoder will be 10, andwe have chosen up to 16 data streams that can be emitted in parallel or, by proper configuration, can be joined to form code rates of l / c , c = 2 . . . 16.
Besides, using more than one input stream broadens the range to blc, b = 1 . . . 15, b < c. Another question is whether to include recursive convolutional codes or not. In [3] it is shown that these codes give better performance at rate 112 and low EalNo compared to non-recursive ones, so it makes sense to realize these functions as well. The use of recursive codes can also he motivated by the fact that they are a basic part of turbo codes 141, that will be realized on this platform in the future by adding an interleaver. Since they use less memory, usually m = 2 . . . 4 , it becomes obvious that the whole model has to be fully parameterizable.
IMPLEMENTATION
The choice of a proper implementation platform for this initial project state is motivated by different reasons. If pursuing low power issues it is advisable to develop dedicated hardware. This is the approach with, highest efficiency per computation because . On the other hand, the design cycle for an FPGA is usually much shorter than for an ASIC, where chip fabrication time has to be taken into account. Thus, an FPGA is better suited for prototyping because functional behavior can be verified much faster. Since this aspect is more important in this initial stage, the system is realized on a XILINX Spartan-I1 FPGA [6]. However, modules are still developed with regard to reusability in a future ASIC, suitable for low power solutions. Figure 8 depicts the flexible encoder architecture that basically consists of 16 parallel encoders described by Figure 7 .
The dashed line connecting the encoders acts as a single bit line used for configuration purposes. Flexibility is incorporated by having total control over the coding polynomials, which requires switches representing these coefficients. A switch uses a register and an AND-gate. Since there are 16 encoders, each having 2m + 1 = 21 coefficients, there will be 336 switches in total.
Thinking about how to configure these switches, it becomes clear that a serial approach has lo be pursued since it is virtually impossible to efficiently route such a number of registers in this FPGA to the required gates. By using a shift register chain to set up these coefficients the routing effort is put on a local level. On this level one can gain a lot by using a so called Configuroble Logic Block (CLB) (61. A CLB in a Spartan-II device consists of two identical slices, with each slice having two configurable registers. two lookup tables realizing combinational functions, and dedicated carry and control logic.
The gray box in Figure 7 enclosing a register, two switches and an XOR-gate is the basic building block of the design and is shown in detail in Figure 9 . The coefficients of a polynomial are represented by the output of the registers on the left side. A multiplication in FZ is simply performed by an AND-gate incorporating the respective switch. Again, the dashed line in this block represents the smallest part ofthe shift register chain. In configuration mode all the coefficient bits ripple through this chain. Since LO serially connected blocks form a basic encoder, 21 shifts are needed U-214 to completely set up one. coding polynomial. Then, the bits are shifted to the next block and so on until the desired number of polynomials is established. Notice that on every hierarchy level there is always just one shift input and one shift output, simplifying the whole routing process and increasing modularity.
Clearly, the configuration process can be utilized at the same time to support power saving since only the programmed blocks should he running. By keeping track of the number of shifts, every block can he enabled separately as shown in Figure 10 . An overflow event should occur when a single block is programmed, that is, after 21 clock cycles. The overflow counter then emits a clock enable pulse and the shift register is set to its next state, enabling one block after another. Contrary to an ASIC, it is highly recommended lo avoid clock gating in an FPGA [7] since it can introduce glitches, and increase clock delay arid cluck skew due to unprofitable routing. However, since a CLB already has a separate clock enable input it can he used to explicitly prevent its flip-flops from changing states and thus consuming switching power.
RESULTS
After thoroughly looking at architectural issucs, there is still the question of how these approaches are mapped into the FPGA. This IS imponam since it basically determines the performance of the systcm. The whole design is based upon the block in Figure 9 and improvement on this level has therefore a positive effect on the system level. Consequently, a bottom-up design style was applied. According to the reports from the synthesis and routing tools, this logic is mapped into one CLB. As mentioned, the three-input XOR-operation will be executed in two steps. First, the result from the two coefficient paths is evaluated in a lookup table with four inputs, namely the two actual inputs from the feedback and the feedforward path and the two respective switch settings that are stored in registers. The I-bit output of this table is then merged with the input from the previous stage in a basic XOR-cell which is inherent in a CLB. Finally, this result is saved in a register. The hardware utilization of this building block is therefore three registers, one look-up table and a basic XOR-cell.
Clearly, this approach efficiently uses the given resources. The maximum estimated clock frequency for this building block is I57 MHr. However, introducing input and output pads that add delay to the design decreases the speed to 1 I O MHz.
Routing reports of the complete system show that 422 slices out of 1200 are used which is equivalent to a CLB usage of35%. Furthermore, 5 17 flip-flops and 235 4-input look-up tables are utilized in those CLBs which corresponds to 21% and 990 usage, respectively. These numbers match well with the preceding considerations where it was shown that a basic building block uses three out of four registers in a CLB and one out of four look-up tables.
The speed requirements are supported by applying suitable synthesis constraints, for example, using an attribute called LOC which advises the routing tool to place logic in defined areas. This can increase the design density which at the same time positively affects the wiring delay. The correctness of the system was verified on all stages of the design process, from functional simulation of the component descriptions to simulation after routing and final testing on the actual PGA-board. The flexible encoder can be safely clocked with up to 50 MHz. However, post-synthesis repons and simulations even verified functional correctness at clock frequencies up to 95 MHz, which could not he tested on the board due to limitations of the clock source.
CONCLUSIONS
This paper presented a flexible convolutional encoder that fully exploits the range of coding polynomials for given memory A thorough investigation of critical path properties for different architectures is the basis for this observation. Power saving issues are already supported with regard to implementation in a future flexible decoder architecture. Since the system performance relies mainly on the basic building block shown in Figure 9 , optimization effort on this level has a positive effect for the whole design. It is shown that the presented approach efficiently uses the resources provided by the FPGA. When developing future dedicated hardware, all modules can he reused and an implementation ofthis basic block in a custom cell will be considered. In order lo broaden the application range of the platform, future work will address a flexible interleave1 that has to be added to the desigd to generate turbo codes.
ACKNOWLEDGMENTS
This project is supponed by the Swedish Socware program. theEU Pacwoman project, and the Competence Center for Circuit Design (CCCD) at Lund University.
