Abstract -Encoders using generator polynomials and linear-feedback shift registers are the key parts of communication technologies widely used in most of today's integrated as well as field systems. This paper presents a detailed comparison of three ways of implementation of configurable encoders arranged in PENCA and implemented in Xilinx and Altera FPGAs.
II. MOTIVATION
Field programmable gate array (FPGA) have enjoyed continuous improvements in all their metrics also due to technology scaling, new micro-and nano-circuits and architectural advances since the first commercial release in 1985. Implementation of FEC (Forward Error Correction) algorithms and LFSRs in FPGAs is a very standard approach, further enhanced using PENCA and its architectural advantages, creating a programmable, universal and dependable platform for baseband applications not only in communication systems. Since both the two leading FPGA platforms are used in our project, what is the optimal implementation and performance in Xilinx and Altera FPGA hardware and architectural platforms with respect to the resources and impact to the system latency caused by the updates of the encoder itself? 
III. PENCA ARCHITECTURE AND ENCODERS
The PENCA architecture is based on an array of runtime programmable LFSRs (Figure 1 ), supporting a programmable mixture of various independent data sources, flexibly combining them and forming desired areas in the final virtual pool of bits and adding special data or parity bits by supporting many block codes and error detection or correction algorithms. Each channel can perform fully programmable polynomial multiplication and division. Together with a programmable counter belonging to each programmable LFSR, each PENCA channel perform a general data encoding and also selected testing tasks. Very low system latency, testability and dependability is the key in dependable industrial systems. For the implementation of forward error correction schemes, we can distinguish between a few basic schemes and implement or use Hamming [6] , Hsiao [7] , Reed-Solomon [8] , BCH [9] , extended code [10, 11] , and other codes [12, 13, 14] . Hence test, configuration, error detection and correction, and design of such systems in general is an extremely complex task. In addition, the new system developed requires a very high level of flexibility not offered by standard architectures and implementations. PENCA is also used for on-line test purposes of all the baseband system chain, since consisting of many independent programmable paths and LFSRs, even performing primarily BCH encoding task, issuing tests and processing also data form used for internal test procedures. PENCA details can be found in [4, 5] . The encoder used in the industrial system supports all Hamming and BCH codes with packet length from 7 up to 1023 data bits and with up to 8 bits error detection and correction capability, as shown in Table  I . It means the generator polynomial should be up to the order of 80, hence 80 DFFs must be reserved per each unit in the FPGA. The code rate is k/n, for every k bits of useful payload information, the encoder generates totally n bits of data as the total bits, of which n-k are parity bits. A block code shortening of all the block codes is supported as well.
IV. PENCA ARRAYS AND CONFIGURABLE LFSRS --THE THREE WAYS OF IMPLEMENTATION
The idea of a fully programmable architecture is not completely new one. There are many patents and papers already published or available, see e.g. [15, 16, 17] . Area-efficient encoders are typically based on Linear Feedback Shift Registers. The PENCA architecture (used in encoder as well as decoder) is based on the array of configurable encoder where each configurable LFSR and counter is configured in a given way in order to perform the desired data encoding algorithm and function. In our case of the following experiments, all the unit contain the same circuit of selected type of configurable LFSR. In order to do the best comparison of our experiments, all the PENCA units also have the same parameters. It means no a honeycomb architecture as introduced in [4] and utilizing also neighbouring units was used in our experiments. The PENCA array is ready for 255 units using 8 bits for the address buses. Both the configuration and communication working data paths are separated, hence one unit can perform the desired task and another unit can be reconfigured at the same using also completely different clock frequencies.
In general, there are the following basic ways of implementation of configurable LFSRs in today's FPGAs: a) a classic, conventional way using programmable logic elements and muxes, b) using reconfigurable LUTs in SLICEM and configuring them directly instead of using auxiliary configuration elements, c) and using partial reconfiguration. 
A. Conventional configurable encoders
The classic and widely used way of implementing configurable encoder (Fig. 3 ) is based on two paths: a series of configuration storage elements (shift registers) keeping the information for each configuration element of a standard LFSR creating the desired generator polynomial. In general, the generator polynomial has 2 basic parameters: its order (number of parity bits) and its coefficients. The order of generator polynomial is selected by a multiplexer as the desired length of the series of flip-flops and XOR gates. The hardware must also contain XOR gates and mux selectors for each coefficient of the generator polynomial. The configuration of all these points must be controlled by the storage elements. In our case, the configuration chain has 97 flip-flops, where 10 are dedicated for the payload counter, 7 for the generator polynomial order mux selector, and 80 control bits per each coefficient controlling the XOR gates and forming the desired generator polynomial. The circuit area overhead in this case comes mainly from the additional configuration chain of DFFs. A programmable XOR gate is created by a single LUT, which is only a slightly slower than a single XOR gate. The main delay and performance penalty is caused by the multiplexer performing the programmable order of the polynomial. 
B. Using RLUTs in SLICEMs
Xilinx FPGAs contain SLICEL and SLICEM configuration units [18] , where SLICEL has LUTs only, and SLICEM with reconfigurable ones enabling implementation of a distributed memory. SLICEM can significantly reduce the FPGA resources required for implementation of shift registers [19] . In the SLICEM circuits (Fig. 3) , the content and therefore the logic function of these 6-input LUTs is not fixed by the FPGA configuration bit stream, and its function can also be controlled from the FPGA area logic during run time. There are also key architectural changes in the Xilinx Ultrascale FPGA generation [20] , especially in the connection of flip-flops to the outputs of LUTs. Although this SLICEM approach has some architectural advantages and is sometimes discussed as RLUT (Reconfigurable LUT), and it can result in much more efficient utilization of this configurable logic block resources as shown in figure 4 , it is near impossible to find any reference on this theme with respect to the LFSR. Obviously, the circuit area overhead consisting mainly from the additional configuration chain of DFFs is replaced by a 9-bit wide configuration path plus LUT configuration write enable decoders (compressed to a single bit input with clock in the final PENCA block). A programmable XOR gate is directly created by the SLICEM programmable LUT. Unfortunately, one additional LUT at the top of each SLICEM is typically left unused due to the first LUT shared data address write signals. In our case, most of such LUTs are also utilized and successfully forming the configuration decoders. The main delay is again caused by the multiplexer performing the programmable order of the generator polynomial. Since the level and ease of configurability of this circuit is much higher than the previous case A), it may open doors for config. faults or unwanted functionality, including hardware Trojans [21] . C. Using partial reconfiguration FPGAs can change their functionality in its parts [22] . It is achieved by changing the configuration bit stream locally, it means using partial reconfiguration [23] , especially in Xilinx FPGAs [24] . Selected physical areas of FPGAs are reserved for the reconfiguration task, and PENCA designs fit right in this area. However, there is some amount of FPGA resources required to perform the reconfiguration task itself. This area overhead may be very big for small designs, and this fact must be considered. On the other hand, the encoder's LSFR itself doesn't contain any configuration interface used before, since all the desired function of the encoder is simply given and fixed by the reconfiguration bitstream ( figure 5) . It means that a higher number of partial bitstreams related to the block code's encoders is required. [28] . Some experiments were performed also on KCU105 development kit [29] , containing Ultrascale XCKU040-2FFVA1156E FPGA [30] and having obvious architectural advantages. Source codes were generated in VHDL by our PENCA generator. The classical version of configurable LFSR requires 91 core DFFs plus 97 DFFs keeping the configuration. Even BCH codes in Table I . do not require all 80 XOR gates to be all programmable ones, the entire LFRS is programmable in all its parts. Hence, the sum of DFFs is 188 per unit, or 2820 in total for 15 units in a typical size and configuration. All 255 PENCA units utilizes the Xilinx FPGA at 93% (near full FPGA), while Altera is at 54%. Altera design is obviously slower than Xilinx, especially the configuration clock can be up to 1169 MHz in Xilinx and only about 600 MHz in Altera FPGA. Ultrascale required about 31 CLBs per unit. Altera FPGA uses the resources in more efficient and predictable way. 
