ABSTRACT: This paper introduces reconfigurable computing (RC) and specifically chooses one of the prototypes in this field, MorphoSys (M1) [1 -5]. The paper addresses the results obtained when using RC in mapping algorithms pertaining to digital coding in relation to previous research [6 -10]. The chosen algorithms relate to cyclic coding techniques, namely the CCITT CRC-16 and the CRC-16. A performance analysis study of the M1 RC system is also presented to evaluate the efficiency of the algorithm execution on the M1 system. For comparison purposes, three other systems where used to map the same algorithms showing the advantages and disadvantages of each compared with the M1 system. The algorithms were run on the 8x8 RC (reconfigurable) array of the M1 (MorphoSys) system; numerical examples were simulated to validate our results, using the MorphoSys mULATE program, which simulates MorphoSys operations.
INTRODUCTION
Reconfigurable computing (RC) is becoming more popular and increasing research efforts are being invested in it [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] . It employs reconfigurable hardware and programmable processors. The user designs the program in a way where the workload is divided between the general-purpose processor and the reconfigurable device. The use of RC paves the way for an increased speed over general-purpose processors and a wider functionality than Application Specific Integrated Circuits (ASICs). It is a solution for applications requiring a wide range of functionality and speed [1] . RC Systems represent a solution to the inflexibility of ASICs on one end of the computing spectrum, and the inefficiency of General Purpose Processors (GPPs) on the other end of the spectrum. Reconfigurable computers (RCs) offer the potential to greatly accelerate the execution of a wide variety of applications. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution.
The TinyRISC processor controls, through the DMA controller, the loading of the context words to context memory. These context words define the function and connectivity of the cells in the RC array. The processor also initiates the loading of application data, such as image frames, from main memory to the frame buffer. This is also done through the DMA controller. Now that both configuration and application data are ready, the TinyRISC processor instructs the RC array to start execution. The RC array performs the needed operation on the application data and writes it back to the frame buffer. The RC array loads new application data from the frame buffer and possibly new configuration data from context memory. Since the frame buffer is divided into two sets, new application data can be loaded into it without interrupting the operation of the RC array. Configuration data is also loaded into context memory without interrupting RC array operation. This causes MorphoSys to achieve high speeds of execution [3] .
RECONFIGURABLE DEVICE
As stated earlier, the reconfigurable device in MorphoSys is the RC array divided into four quadrants. It has the design and interconnection shown in Figure 2 [2] . The interconnection network is built on three hierarchical levels. The first is a nearest neighbor layer that connects the RCs in a 2-D mesh. The second is an intra-quadrant connection that connects a specific RC to any other RC in its row or column in the same quadrant. The third is an inter-quadrant connection that carries data from any one cell (out of four) in a row (or column) of a quadrant to other cells in an adjacent quadrant but in the same row (or column) [4] .
The context words loaded into context memory configure the function of the RCs as well as the interconnection, thus specifying where their input is from and where their output will be written to [5] . MorphoSys is designed in a way where all the cells in the same row perform the same function and have the same connection scheme (in row context broadcast mode), or all the cells in the same column perform the same function and have the same connection scheme (in column context broadcast mode). All the cells of a row or of a column share the same configuration [5] .
CODING ALGORITHMS UNDER MORPHOSYS
Reconfigurable hardware implementation of digital coding algorithms has been an active area of research [6 -10] . Many coding algorithms where mapped onto the M1. Research done to date includes: performance study of coding algorithms (checksum), pipelined implementation of various non-standard linear sequential coding circuits, and other algorithms [8, 10] . The linear sequential circuits considered here are finite state machines with a finite number of inputs and outputs. The inputs, outputs and state transition occur at discrete intervals of time. The elements used are adders (EX-OR) and the delays (D) to delay input words. A sequence of 0s and 1s can be expressed by a polynomial in which the 0s and 1s are coefficients of the powers of a dummy variable. Hence, the sequence 11001 can be written as:
This representation is the basis of the Feed Forward Binary Circuits, which are very useful in coding techniques. A generalized form, shown in Figure 3 , of these circuits is represented by ) 1 ( ) (
This circuit is used to code any stream of input vector X and yields a set of outputs Y. Therefore, the input vector X is a vector of 0s and 1s and the output is the coded output Y The link to the formal publication is via https://doi.org/10.1016/S0965-9978(02)00101-1 vector, which is the result of multiplying the input polynomial (vector) X with the polynomial represented by T(D). This could be generalized to take the form:
Where, N is the number of stages of the circuit, and X is of the form
with xk being the first bit to enter the multiplier circuit. This paper focuses on the hardware implementation of cyclic redundancy codes checkers (CRCCs) with their standard circuits, namely the CRC-16 circuits.
CYCLIC REDUNDANCY CODES
Redundant encoding is a method of error detection that spreads the information across more bits than the original data. The more redundant bits used, the greater the chance to detect errors. CRCCs are check for differences between transmitted data and the original data. CRCCs are effective for two reasons: Firstly, they provide excellent protection against common errors, such as burst errors where consecutive bits in a data stream are corrupted during transmission. Secondly, systems that use CRCCs are easy to implement [11] . When a CRCC is used to verify a frame of data, the frame is treated as one very large binary number, which is then divided by a generator number. This division produces a reminder, which is transmitted along with the data. At the receiving end, the data is divided by the same generator number and the remainder is compared with the one sent at the end of the data frame. If the two remainders are different, then an error occurred during data transmission. Types of errors that a CRCC detects depend on the generator polynomial. Table 1 shows the most common generator polynomials.
CRC Serial Implementation
CRC implementation is usually done with linear-feedback shift registers (LFSRs). Figures 4 and 5 show the CCITT CRC-16 and CRC-16 generators with their serial implementation using LFSR. This serial method works well when the data is available in bitstream form.
CRC Parallel Implementation
With the currently available high-speed digital signal processing (DSP) systems, the processing of data is done in a byte, word, double word, or larger widths rather than serially. Even with serial telecommunication systems, data is buffered in chips responsible for synchronizations and framing. For parallel implementation the data is available in 8-bit frames with manageable speed [11] . A one channel parallel CRC algorithm with LFSR approach is done by considering the state of the circuit on 8-shifts basis [11 -12] . Tables 2-3 show two different implementations of the CCITT CRC-16 and CRC-16, respectively. The term Registeri represents the LFSR internal register numbered "i", while XORj represents the output of the XOR gate number "j", and XOR indicates the XOR operation. With the emergence of the highly scalable reconfigurable circuits, more implementation capabilities are present. Along with the byte-wise or word-wise CRC implementation it is possible to implement parallel channels each with byte-wise CRC implementation.
ALGORITHMS MAPPING
From the underlying architecture point of view, the mapping of any algorithm onto the proposed reconfigurable system requires in-depth knowledge of all the available interconnection topologies. Moreover, the designer should take into consideration the possibility of dynamically changing the shape of the interconnection. From the algorithmic point of view, the design of a parallel version of the addressed algorithms requires the best use of recourses with the least possible redundancy in computations.
Three sets of data are required to map any algorithm onto the M1 system. The first set specifies the intended shapes of interconnections that are going to be used. The second set is the manipulated data. Lastly, the last set of code is the TinyRISC code that will orchestrate the load/save operations, parallel computations, and changing the interconnection pattern through the context words.
The CCITT CRC-16 Algorithm Mapping
The mapping of the parallel CCITT CRC-16 algorithm will make use of the redundant computations utilized in several steps of the algorithm. Firstly, the values of XORi for all values of i from 0 to 7 are calculated. In Table 2 the computations that are used more than once are shown. Particularly, the redundant values are (XORi+4  XORi) for i from 0 to 3. Thus, the second computation step involve registers 4, 5, 6, 11, 12, 13, 14, and 15 depending on results found in the first step. In the final step the computations for registers 0,1,2,3,7,8,9 and 10 are carried out depending on the results calculated in the first and second step.
The algorithm mapping will be explained by introducing the three needed sets of code. The first set is the interconnection context words. The context word used in this algorithm is that for XOR with column broadcast, where each cell XORs two inputs from frame buffers A and B. This context word is stored at address 30000hex.
The second set of code is the input data and the initial data in the circuits registers. These two sets of data are stored in main memory address 10000hex and 20000hex.
The third set is that of the TinyRISC code, which is the main code. This code and its discussion are shown in Table 4 . Main steps of the addressed algorithm are shown in Figures 6 and 7. The final contents of frame buffer A is shown in Figure 8 .
The CRC-16 Algorithm Mapping
The mapping of this algorithm depends also on eliminating redundant computations, besides, the parallel computation of the required values. This mapping is of three steps. Firstly, the values of XORi for all values of i from 0 to 7 are calculated. From Table 3 the computations used more than once are shaded, particularly, the redundant value X (XOR0  …  XOR7). Thus, the second computation step is for X. Thirdly, the rest of the values are calculated in parallel.
The algorithm mapping will be explained by introducing the three needed sets of code. The first set is the interconnection context words. The context words used in this algorithm are firstly, that for XOR with column broadcast, where each cell XORs two inputs from frame buffers A and B. Secondly, the same cell operation is used also by taking one input from the frame buffer, and the second from the output of the left adjacent cell. The context words are stored at address 30000hex. The second set of code is the input data and the initial data in the circuits registers. These two sets of data are stored in main memory address 10000hex and The link to the formal publication is via https://doi.org/10.1016/S0965-9978(02)00101-1 20000hex. The third set is that of the TinyRISC code which is the main code, this code and its discussion are shown in Table 5 . Main steps of the addressed algorithm are shown in Figures 9 and 10. The final contents of frame buffer A is shown in Figure 11 .
PERFORMANCE EVALUATION AND ANALYSIS
The performance is based on the execution speed of the algorithms presented in sections 6.1 and 6.2 corresponding to Tables 2 and 3 respectively, which show the states of the registers after 8 shifts. The MorphoSys system is considered to be operational at a frequency of 100 MHz.
The algorithm in Table 4 (CRC-CCITT-16 Parallel Algorithm for a single channel) takes 30 cycles to complete. The speed in bits per cycle of the algorithm of Table 4 is equal to 0.267 bits/cycles i.e. 3.75 cycles for each bit. The time for the algorithm to terminate is equal to 0.3 sec, and the data rate is 26.67 Mbps.
The algorithm in Table 5 (CRC-16 Parallel Algorithm for a single channel) takes 26 cycles in order to terminate. The cycle time for the MorphoSys is equal to 10 nsec. Thus, the speed in bits per cycle of the algorithm of Table 5 is equal to 0.307 bits/cycles i.e. 3.25 cycles for each bit. The time for the algorithm to terminate is equal to 0.26 sec, and then the rate in Mega bits per second (Mbps) is 30.76 Mbps.
Furthermore, a comparison is done with the same algorithms mapped onto some Intel microprocessing systems. In this research the chosen processors are the Intel 80486 and Pentium. Note that the instructions used are upward compatible with newer Intel processors. The code and discussion of the same algorithms in Tables 4 and 5 are shown in Tables 6 and 7 respectively. Note that the chosen systems have comparable frequencies of 100 ~ 133 MHz.
In addition to Intel systems, the RC-1000 FPGA is used for performance comparisons. The Celoxica RC1000 board provides high-performance, real-time processing capabilities and is optimized for the Celoxica DK1 design suite. The RC1000 is a standard PCI bus card equipped with a Xilinx® Virtex™ family BG560 part with up to 2 million system gates. It has 8MB of SRAM directly connected to the FPGA in four 32-bit wide memory banks. The memory is also visible to the host CPU across the PCI bus as if it were normal memory. Each of the 4 banks may be granted to either the host SRAM on the board. It is then accessible to the FPGA directly and to the host CPU either by DMA transfers across the PCI bus or simply as a virtual address. Comparisons among those systems are shown in Tables 8 and 9 .
For the maximum exploitation of the M1 capabilities, it should be noted that the M1 data items are byte-wise (8 bits). Thus, the M1 can calculate in parallel the input of up to 8-channels simultaneously. This is also shown in Tables 8 and 9. Note that the FPGA RC-1000 findings are the same for a single channel or 8-channels input because of its scalability. The speedup factors, besides the other chosen metrics, show the superiority of the used reconfigurable computing systems. The speedup factor is considered to be the ratio between the cycle times of the suggested systems.
CONCLUSION
New mapping algorithms are introduced dealing with coding operations and its performance analysis under MorphoSys is proposed. Many findings besides the speed of these mappings are calculated, and results are compared with other processing systems. The cyclic coding
The link to the formal publication is via https://doi.org/10.1016/S0965-9978(02)00101-1 algorithms are presented with their mapping onto the M1. Accordingly, speeds of 213.13 Mbps for the CRC-CCITT-16 and 246.15 Mbps for the CRC-16 were achieved. The speedup factors (ratio of number of clocks) ranged from 4.26 to 58.46 between the M1 and the Intel processing systems. Moreover, the speed up factors between the RC-1000 and the M1 were up to 3.75. Future efforts could be invested in trying to map other algorithms that make use of the already mapped ones for more advanced algorithms for digital coding. The current research includes the work with other cyclic redundancy check algorithms along with other state-of-the-art coding methods. Also, comparisons could be made with results available on other parallel processors.
[6] Damaj I, Diab H. Performance evaluation of linear algebraic functions using reconfigurable computing. International Journal of Super Computing (Accepted).
[ Figure 1 . MorphoSys Block Diagram. Figure 2 . RC Array Interconnection. The link to the formal publication is via https://doi.org/10.1016/S0965-9978(02)00101-1 Tables:  Table 1 . Common generator polynomials. Table 2 . The states of the registers after 8-shifts for the CCITT CRC-16 Algorithm. The link to the formal publication is via https://doi.org/10.1016/S0965-9978(02)00101-1
List of

Figure 5. LFSR implementation of the CRC-16
The link to the formal publication is via 
. 31
. . Figure 11 . Contents of Frame Buffer A after the algorithm terminates after one computation step the new registers values are shown at the specified locations.
The link to the formal publication is via Table 2 . The states of the registers after 8-shifts for the CCITT CRC-16 Algorithm. Table 9 . Comparisons with RC-1000 FPGA.
New Values After 8-shifts of the registers and the output of the XOR-gates
XORi = Registeri  DataIni I = 0, 1, …, 7 Register0 = Register8  XOR4  XOR0 Register1 = Register9  XOR5  XOR1 Register2 = Register10  XOR6  XOR2 Register3 = Register11  XOR0  XOR7  XOR3 Register4 = Register12  XOR1 Register5 = Register13  XOR2 Register6 = Register14  XOR3 Register7 = Register15  XOR4  XOR0 Register8 = XOR0  XOR5  XOR1 Register9 = XOR1  XOR6  XOR2 Register10 = XOR2  XOR7  XOR3 Register11 = XOR3 Register12 = XOR4  XOR0 Register13 = XOR5  XOR1 Register14 = XOR6  XOR2 Register15 = XOR7  XOR3
XORi = Registeri  DataIni i = 0, 1, …, 7 X XOR0  XOR1  … XOR7 Register0 = Register8  X Register1 = Register9 Register2 = Register10 Register3 = Register11 Register4 = Register12 Register5 = Register13 Register6 = Register14  XOR0 Register7 = Register15  XOR1  XOR0 Register8 = XOR3  XOR2 Register9 = XOR4  XOR3 Register10 = XOR5  XOR4 Register11 = XOR6  XOR5 Register12 = XOR7  XOR6 Register13 = XOR8  XOR7 Register14 = XOR8  X Register15 = X
