ABSTRACT: This paper introduces a new mapping of geometrical transformation on the MorphoSys (M1) reconfigurable computing (RC) system. New mapping techniques for some linear algebraic functions are recalled. A new mapping for geometrical transformation operations is introduced and their performance on the M1 system is evaluated. The translation and scaling transformation addressed in this mapping employ some vector-vector and vector-scalar operations [6] [7] . A performance analysis study of the M1 RC system is also presented to evaluate the efficiency of the algorithm execution. Numerical examples were simulated to validate our results, using the MorphoSys mULATE program, which emulates M1 operations.
INTRODUCTION
Reconfigurable computing (RC) is becoming more popular and increasing research efforts are being invested in it [1] . It employs reconfigurable hardware and programmable processors. The application is mapped such that the workload is divided between the general-purpose processor (GPP) and the reconfigurable device. The use of RC paves the way for an increased speed over general-purpose processors and a wider functionality than application specific integrated circuits (ASICs). It is a good solution for applications requiring a wide range of functionality and speed at the same time [1] . RC systems represent a solution to the inflexibility of ASICs on the one end of the computing spectrum, and the inefficiency of GPPs on the other end of the spectrum.
MORPHOSYS DESIGN
One of the emerging RC systems includes the MorphoSys designed and implemented at the University of California, Irvine. It has the block diagram shown in Figure 1 [2] . It is composed of: 1) an array of reconfigurable cells called the RC array, 2) its configuration data memory called context memory, 3) a control processor (TinyRISC), 4) a data buffer called the frame buffer, and 5) a DMA controller [2] . A program runs on MorphoSys in the following manner: General-purpose operations are handled by the TinyRISC processor, while operations that have a certain degree of parallelism, regularity, or intensive computations are mapped to the RC array. The TinyRISC processor controls, through the DMA controller, the loading of the context words to context memory. These context words define the function and connectivity of the cells in the RC array. The processor also initiates the loading of the application data, such as image frames, from main memory to the frame buffer. This is also done through the DMA controller. Now that both configuration and application data are The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 ready, the TinyRISC processor instructs the RC array to start execution. The RC array performs the needed operation on the application data and writes it back to the frame buffer. The RC array loads new application data from the frame buffer and possibly new configuration data from context memory. Since the frame buffer is divided into two sets, new application data can be loaded into it without interrupting the operation of the RC array. Configuration data is also loaded into context memory without interrupting RC array operation. This causes MorphoSys to achieve high speeds of execution [3] .
RECONFIGURABLE DEVICE
As stated earlier, the reconfigurable device in MorphoSys is the RC array divided into four quadrants. It has the design and interconnection shown in Figure 2 [2] . The RC interconnection network is comprised of three hierarchical levels. The first layer provides a nearest neighbor connectivity that connects the RCs in a 2-D mesh. The second layer is an intra-quadrant connection that connects a specific RC to any other RC in its row or column in the same quadrant (a quadrant is a 4 by 4 group of cells in the RC array). The third layer is an inter-quadrant (or express lane) connection that carries data from any one cell (out of four) in a row (or column) of a quadrant to other cells in an adjacent quadrant but in the same row (or column) [4] . The context words present on context memory configure the function of the RCs as well as the interconnection, thus specifying where their input is from and where their output will be written [5] . MorphoSys is designed in a way where all the cells in the same row perform the same function and have the same connection scheme (in row context broadcast mode), or all the cells in the same The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 column perform the same function and have the same connections scheme (in column context broadcast mode). All the cells of a row or of a column share the same configuration word [5] .
The reconfigurable cell is the basic programmable element in MorphoSys. Each reconfigurable cell (Figure 3 ) comprises five components: the ALU/Multiplier, the shift unit, the input multiplexers, a register file with four-bit registers, and the context register. There are 64 reconfigurable cells arranged as an 8x8 matrix called the RC Array. The ALU/Multiplier has four data input ports: two 16-bit ports receive data from the input multiplexers, one bit port takes data from the output register, and a bit port takes an immediate value in the context word. In addition to standard arithmetic and logical operations, the ALU/Multiplier can perform a multiply accumulate operation in a single cycle. The shift unit is also 32-bits wide. In the current MorphoSys prototype, the ALU-Multiplier operates only on signed numbers. However, several important applications such as data encryption/decryption involve multiplication of unsigned numbers. Therefore, the ALU-Multiplier will be extended for operation using both signed and unsigned values in the next implementation of MorphoSys. 
GEOMETRICAL TRANSFORMATIONS IN COMPUTER GRAPHICS
Transformations are a fundamental part of computer graphics. Transformations are used to position, shape, and change viewing positions of objects, as well as change how they are viewed (e.g. the type of projection used). There are many types of two-dimensional transformations that one can perform; however, in this paper we will address: Translation and Scaling. These basic transformations can also be combined to obtain more complex transformations. Figure 4 shows the effects of some 2D transformations on an image. A point p in 2D is represented by p(x, y) where x is the x-coordinate and y is the y-coordinate of p. 2D objects are often represented as a set of points (vertices), {P1,P2,...,Pn}, and an associated set of edges {e1,e2,...,em}. An edge is defined as a pair of points, e{Pi, Pj}. We can also represent points in vector/matrix notation as:
Translations
A translation can also be represented by a pair of numbers, t=(tx,ty) where tx is the change in the x-coordinate and ty is the change in y-coordinate. To translate the point p by t, we simply add to obtain the new (translated) point q(x',y') = p(x,y) + t(tx,ty). 
ALGORITHM MAPPING
The main usage of the MorphoSys is, as any parallel processing system, to perform fast computations of algorithms that need a certain computational power requirement. Computer graphics algorithms represent one of these families. Computer graphics accelerators are the subject of much research. A basic part in computer graphics operations is geometrical transformations, which require fast computations of vector-vector operations and vector-scalar operations. The emphasis in this paper is the mapping of vector and scalar operations on the MorphoSys, e.g. vectors addition, subtraction, and multiplication by a scalar. These algorithms represent the core of many algorithms, especially, those used for translation and scaling in geometrical transformations.
TRANSFORMATION WITH VECTORS OPERATIONS
Generally, a one-dimensional n-element vector has the form:
In our case of geometrical transformation, a vector U could be considered as the original coordinates, while a vector V could be considered as the corresponding translation values. Mapping an algorithm for addition, or any other operation, of the two vectors is done by first storing them in the Frame Buffer set "0" and set "1". Then we can exploit the properties of the interconnection, where some contents of Frame Buffer set "0" are added to some contents of Frame Buffer set "1" and the result would be in columns 0-7 of the RC-array. Figure 7 shows the final output in the RC-array after running the algorithm of adding two 64-element vectors.
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 Figure 7 . RC array contents after matrix addition
For the MorphoSys to perform the required calculations, three sets of data must be first entered to the M1 chip. Firstly, the TinyRISC program that controls the functionality of the whole system. This code is placed in main memory ( Figure 1 ) and handles all the operations that are not mapped onto the RC array such as data transfer. It also provides the RC array with which contexts it needs to run, what data to access as input, and where the output data is written. Secondly, the code for the RC array operation is called the context code. This is written for either column mode or row mode or for both. It defines what operation each row or column is going to carry out, what input it takes, and where the output is to be stored. Finally, the data required for computations is stored in the Frame Buffer for later retrieval and use by the TinyRISC program.
Let the desired function of the interconnection be: Out = A + B. Thus, the context word would be: 0000F400. This must be loaded into the column block of the context memory. Assume that vector U is stored in address 10,000hex of main memory, vector V stored in address 20,000hex and the context word stored in address 30,000hex. Then the answer will be stored back to address 40,000hex. The M1 code and its discussion are provided in Table 1 .
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 96: stfb r1, 1, 0,10hex; Store data from frame buffer set 1, address 0 into main memory starting at address stored in reg1.
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939
TRANSFORMATION WITH VECTORS & SCALARS OPERATIONS
Consider an 8-element vector U. Mapping the algorithm for multiplication, or any other operation (arithmetic or logical), of a vector by a scalar, is done by first storing the vector in the Frame Buffer set "0". Then we can exploit the properties of the interconnection, where some contents of the Frame Buffer set "0" are multiplied by a constant to be stored in the context word. Figure 8 shows the final output in the RC-array after running the algorithms of two 64-element vectors. Figure 8 . RC array contents after multiplication by a scalar of a 64-element vector is performed
The desired function of the interconnection is: Out (t+1) = c x A. Lets assume that the constant "c" has a value of 5hex. According to the internal bit assignment of the ALU control lines, the context word is: 00009005. This word must be loaded into the column block of the context memory. Assume that vector U is stored in address 30,000hex of main memory, and the context word stored in address 40,000hex. Then the answer will be stored back to address 50,000hex. The M1 code and its discussion are provided in Table 2 .
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 55: stfb r1, 1, 0, 10hex; Store data from frame buffer set 1, address 0 into main memory starting at address stored in reg1.
ROTATION TRANSFORMATIONS
Rotation and composite transformations represent two other kinds of graphics geometrical transformations. Both transformations are mapped onto the MorphoSys as a matrix multiplication operation [8] . Results are presented in the following section. Multiplying matrix A with matrix B would mean multiplying row one (r1) of A with column one of B and then adding their results yielding (c11) of the result matrix C. The multiplication with (r1) is repeated to all columns of B yielding (c2 .. cn). Then, (r2) of A repeats the same multiplication with all columns of B. This algorithm is repeated until the last row in A.
Matrices A, B, and C are dense matrices. The matrix-matrix multiplication involves O(n 3 ) operations on a single processing plat form, since for each element Cij of C, we must compute
. This simple Algorithm is mapped onto the M1 RC-Array as follows: The contents of matrix A are passed row by row through the context words, thus, stored in the context memory for later retrieval and manipulation by the reconfigurable cells. The contents of matrix B are broadcasted also row by row to the columns of the RC array. The multiplication stage (row x column) is done by using the CMUL (constant-multiply) ALU operation where Out (t) = AxB; which is the required computation. Note that CMUL is a vector-scalar operation discussed in [7] .
PERFORMANCE ANALYSIS
Performance evaluation and comparisons are made among the mappings already suggested. This performance is based on the speed of execution of the algorithms. The MorphoSys M1 system is operational at a frequency of 100 MHz. After obtaining the results of the mapped algorithms onto the M1 system, a comparison is done with the mapping of the algorithms onto single-processor systems namely the Intel 80386 and 80486. These processors where chosen as they are comparable in CPU speed to the M1 system. Note that the instructions used are upward compatible with later versions of the Intel processors.
PERFORMANCE OF THE TRANSLATION ALGORITHMS
The code using the Intel Instruction set is shown in Table 3 . The comparisons between the three different processing systems are shown in Figures 9-12 . These figures clearly show the superiority of MorphoSys over the other suggested processors. The number of elements per cycle for the 8-element vector-vector translation algorithms yielded the following ratios (in elements/cycle) on the M1, 80386, and the 80486: 0.38, 0.036, and 0.088 respectively. For the case of 64-element vectors it yielded: 0.667, 0.036, and 0.083 respectively. Therefore, the performance of M1 compared to the 80486/386 is superior with respect to the number elements that can be processed per cycle.
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 
PERFORMANCE OF THE SCALING ALGORITHMS
The code using the Intel Instruction set is shown in Table 4 . The comparisons between the three different processing systems are shown in Figures 13-16 . These figures also clearly show the superiority of the MorphoSys over the other processor. The number of elements per cycle for the 8-element vector-vector translation algorithms yielded the following ratios (in elements/cycles) on the M1, 80386, and the 80486: 0.57, 0.046, and 0.108 respectively. For the case of 64-element vectors it yielded: 1.16, 0.046, and 0.11 respectively. Accordingly, the performance of M1 compared to the 80486/386 is superior with respect to the number elements that can be processed per cycle. The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 
9.03
The link to the formal publication is via https://doi.org/10.1023/A:1020993510939 Table 5 show the superiority of the addressed RC-system the M1 over the other suggested systems. This is clearly indicated by the speedup factor that reached 24 in some cases. On the other hand, the major issue is not only comparing the suggested algorithms with other systems running the same algorithm, but it is the trials to find and tune the best possible algorithm mapping that gives the most favourable performance using this state-of-the-art reconfigurable M1 system. The speedup factors, shown in Table 5 , are the ratios of the number of execution cycles of the M1 over the other systems. The discussed findings are part of a complete graphics acceleration library using the M1 reconfigurable system. The link to the formal publication is via https://doi.org/10.1023/A:1020993510939
ANALYSIS OF RESULTS

Results in
CONCLUSION
The MorphoSys RC system has been utilized for different areas of application. With computer graphics computationally intensive algorithms, the M1 system has been used in image processing, graphics acceleration for animation, and currently with geometrical transformations. In this paper, new mapping techniques for some linear algebraic functions are recalled [6] [7] . A new justification is introduced dealing with geometrical transformations operations and their performance analysis under MorphoSys is proposed. These operations could be addition, subtraction or other operation supported in the MorphoSys processing elements (RC array). The speed of this mapping is calculated for different vectors and for different number of elements. The results compared with other processing systems. The M1 system yielded a speedup (with respect to number of cycles needed) of 8 and 10.5 for the 64-element translation and scaling algorithms respectively, as well as a speedup of around 37 for the 64-element rotation algorithm compared with the 80486. Finally, effort could be invested in trying to map other algorithms that make use of the mapped ones for more advanced algorithms for computer graphics.
