Abstract: This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated reconfigurable system-on-chip targeted at high-throughput and data-parallel applications. It comprises of a reconfigurable array of processing cells, a modified RISC processor core and an efficient memory interface unit. This paper describes the MorphoSys architecture, including the reconfigurable processing array, the control processor and data and configuration memories. The suitability of MorphoSys for the target application domain is then illustrated with examples such as video compression, data encryption and target recognition. Performance evaluation of these applications indicates improvements of up to an order of magnitude on MorphoSys in comparison with other systems.
Introduction
Reconfigurable systems are computing systems that combine a reconfigurable hardware processing unit with a software-programmable processor. These systems allow customization of the reconfigurable processing unit, in order to meet the specific computational requirements of different applications.
Reconfigurable computing represents an intermediate approach between the extremes of Application Specific
Integrated Circuits (ASICs) and general-purpose processors. A reconfigurable system generally has wider applicability than an ASIC. In addition, the combination of a reconfigurable component with a generalpurpose processor results in better performance (for many application classes) than the general-purpose processor alone.
The significance of reconfigurable systems can be illustrated through the following example. Many applications have a heterogeneous nature and comprise of several sub-tasks with different characteristics. For instance, a multimedia application may include a data-parallel task, a bit-level task, irregular computations, high-precision word operations and a real-time component. For such complex applications with wide-ranging sub-tasks, the ASIC approach would lead to an uneconomical die size or a large number of separate chips.
Also, most general-purpose processors would very likely not satisfy the performance constraints for the entire application. However, a reconfigurable system (that combines a reconfigurable component with a mainstream microprocessor) may be optimally reconfigured for each sub-task, meeting the application constraints within the same chip. Moreover, it would be useful for more general-purpose applications, too. This paper describes MorphoSys, a novel model for reconfigurable computing systems, targeted at applications with inherent data-parallelism, high regularity, and high throughput requirements. Some examples of these applications are video compression (discrete cosine transforms, motion estimation), graphics and image processing, data encryption, and DSP transforms. Reconfiguration is the process of reloading configuration programs (context). This process is either static (execution is interrupted) or dynamic (in parallel with execution). A single context RPU typically has static reconfiguration. Dynamic reconfiguration is more relevant for a multi-context RPU. It implies that such a RPU can execute some part of its configuration program, while the other part is being changed. This feature significantly reduces the overhead for reconfiguration.
(d) Interface:
A reconfigurable system has a remote interface if the system's host processor is not on the same chip/die as the RPU. A local interface implies that the host processor and the co-processor RPU reside on the same chip, or that the RPU is unified into the datapath of the host processor.
(e) Computation model: Many reconfigurable systems follow the uniprocessor computation model. However, there are several others that follow SIMD or MIMD computation models ( [4] , [7] , [8] and [11] ). Some systems may also follow the VLIW model [2] .
Conventionally, the most common devices used for reconfigurable computing are field programmable gate arrays (FPGAs) [1] . FPGAs allow designers to manipulate gate-level devices such as flip-flops, memory and other logic gates. This makes FPGAs quite useful for complex bit-oriented computations. Examples of reconfigurable systems using FPGAs are [9] , [10] , [27] , and [29] . However, FPGAs have some disadvantages, too. They are slower than ASICs, and have inefficient performance for coarse-grained (8 bits or more) datapath operations. Hence, many researchers have proposed other models of reconfigurable systems targeting different applications. PADDI [2] , MATRIX [4] , RaPiD [6] , and REMARC [7] are some of the coarse-grain prototype reconfigurable computing systems. Research prototypes with fine-grain granularity (but not based on FPGAs) include DPGA [3] and Garp [5] . Table 1 summarizes the characteristics of various reconfigurable systems according to the criteria introduced above. The correspondence between this figure and the architecture in Figure 1 is as follows: the RC Array with its Context Memory corresponds to the reconfigurable processor array (SIMD co-processor), TinyRISC corresponds to the Main Processor, and the high-bandwidth memory interface is implemented through Frame
Buffer and DMA Controller. Typically, the core processor, TinyRISC executes sequential tasks of the application, while the reconfigurable component (RC Array) is dedicated to the exploitation of parallelism available in an application's algorithm. Array, shown in Figure 3 . Each RC has an ALU-multiplier and a register file and is configured through a 32- TinyRISC, based on the design of a RISC processor in [12] . TinyRISC handles general-purpose operations and controls operation of the RC Array through special instructions added to its ISA. It also initiates all data transfers to or from the Frame Buffer and configuration program load for the Context Memory.
System Components
Frame Buffer: An important component is the two-set Frame Buffer, which is analogous to a data cache.
It makes memory accesses transparent to the RC Array, by overlapping of computation with data load and store, alternately using the two sets. MorphoSys performance benefits tremendously from this data buffer. A dedicated data buffer has been missing in most of the contemporary reconfigurable systems, with consequent degradation of performance.
Features of MorphoSys
The RC Array is configured through context words. 
TinyRISC Instructions fo r MorphoSys
Several new instructions were introduced in the TinyRISC instruction set for effective control of the MorphoSys RC Array execution. These instructions are summarized in Table 2 . They perform the following functions:
• data transfer between main memory (SDRAM) and Frame Buffer,
• loading of context words from main memory into Context Memory, and
• control of execution of the RC Array.
There are two categories of these instructions: DMA instructions and RC instructions. The DMA instruction fields specify load/store, memory address, number of bytes to be transferred and Frame Buffer or
Context Memory address. The RC instruction fields specify the context to be executed, Frame Buffer address and broadcast mode (row/column, broadcast versus selective). Once again, TinyRISC issues one of CBCAST, SBCB or DBCB--instructions (Table 2 ) each cycle to enable execution of RC Array in row/column mode for as long as computation requires. Within this time, TinyRISC also issues instructions (STFB and LDFB) to store data from first FB set into main memory and load new computation data into the first set of FB.
(e) Continue execution and data/context transfers till completion
The steps (c) and (d) are repeated till the application kernel concludes.
Design of MorphoSys Components
In this section, we describe the major components of MorphoSys: the Reconfigurable Cell, the Context Memory, the Frame Buffer and the three-level interconnection network of the RC Array. We also briefly mention some aspects of the ongoing physical implementation of MorphoSys components.
Architecture of Reconfig urable Cell
The Reconfigurable Cell (RC) array is the programmable core of MorphoSys. It consists of an 8x8 array of an operation is to be stored (REG #), and the direction (RS_LS) and amount of shift (ALU_SFT) applied at the ALU output. The context word also specifies whether a particular RC writes to its row/column express lane (WR_Exp). The context word field WR_BUS specifies whether the RC output will be written on the data bus to the Frame Buffer. The field named Constant is used to supply immediate operands to the ALUmultiplier unit in each RC. This is useful for operations that involve constants (such as multiplication by a constant over several computations) in which case this operand can be provided through the context word. 
Selective context enabling:
This feature implies that it is possible to enable one specific row or column for operation in the RC Array. One benefit of this feature is that it enables transfer of data to/from the RC Array, using only one context plane. Otherwise eight context planes (out of the 32 available) would have been required just to read or write data.
Interconnection Networ k
The RC interconnection network is comprised of three hierarchical levels.
RC Array mesh:
The underlying network throughout the array ( Figure 3 ) is a 2-D mesh. It provides nearest neighbor connectivity.
Intra-quadrant (complete row/column) connectivity:
The second layer of connectivity is at the quadrant level (a quadrant is a 4x4 RC group). The RC Array has four quadrants ( Figure 3 ). Within each quadrant, each cell can access the output of any other cell in its row/column ( Figure 3 ).
Inter-quadrant (express lane) connectivity:
At the global level, there are buses between adjacent quadrants. These buses (express lanes) run across rows and columns. Figure 6 shows express lanes for one row of the RC Array. These lanes provide data from any one cell (out of four) in a row (column) of a quadrant to other cells in an adjacent quadrant but in the same row (column). Thus, up to four cells in a row (column) can access the output value of one of four cells in the same row (column) of an adjacent quadrant.
w e x p r e s s l a n e R o w e x p r e s s l a n e = = >

Figure 6: Express lane connectivity (between cells in same row, but adjacent quadrants)
The express lanes greatly enhance global connectivity. Irregular communication patterns can be handled quite efficiently. For example, an eight-point butterfly operation is accomplished in only three cycles.
Frame Buffer
The Frame Buffer (FB) is an internal data memory logically organized into two sets, called Set 0 and Set 1. The Frame Buffer has this two set organization in order to be able to provide overlap of computation with data transfers. One of the two sets provides computation data to the RC Array (and also stores processed data from the RC Array) while the other set stores processed data into the external memory through the DMA controller and reloads data for the next round of computations. These operations proceed concurrently, thus preventing the latency of data I/O from adversely affecting system performance. Each set has 128 rows of 8 bytes each, therefore the FB has 128 × 16 bytes.
Physical Implementatio n
MorphoSys M1 is being implemented using both custom and standard cell design methodologies for 0.35 micron, four metal layer CMOS (3.3V) technology. The main constraint for this implementation is a clock period of 10 ns (100 MHz freq.). The total area of the chip is estimated to be 180 sq. mm. The layout for the Reconfigurable Cell (20000 transistors, area 1.5 sq. mm) is now complete. It has been simulated at the transistor level using an electrical simulator (HSPICE) with appropriate output loads due to fanout and interconnect lengths to obtain accurate delay values. The multiplier (approx. 10000 transistors) delay is 4 ns and the ALU (approx. 6500 transistors) delay is 3 ns. The critical path delay in a RC (which corresponds to a single cycle multiply-accumulate operation) is less than 9 ns. Similarly, the TinyRISC, Frame Buffer, Context Memory and DMA controller are also being designed to perform within the 10 ns clock constraint.
Preliminary estimates for area/delay are: TinyRISC (100,000 transistors, delay: 10 ns), Frame Buffer (150,000 transistors, access time: 10 ns), and the Context Memory (100,000 transistors, access time: 5 ns).
The three-level interconnection network is made feasible by the four metal layer technology. Simulations of the network indicate that with use of appropriate buffers at RC outputs, interconnect delays can be limited to 1 ns. Thus, it is reasonable to expect that M1 will perform at its anticipated clock rate of 100 MHz.
Comparison with Related Research
Since MorphoSys architecture falls into the category of coarse-grain reconfigurable systems, it is meaningful to compare it with other coarse-grain systems (PADDI [2] , MATRIX [4] , RaPiD [6] , REMARC [7] and RAW [8] ). Many of the designs mentioned above have not actually been implemented, whereas
MorphoSys has been developed down to physical layout level. The MATRIX [4] approach proposes the design of a basic unit (BFU) for a reconfigurable system. This 8-bit BFU unifies resources used for instruction storage with resources needed for computation and data storage, assumes a three level interconnection network and may be configured for operation in VLIW or SIMD fashion. However, this approach is too generic, and may potentially increase the control complexity.
Further, a complete system organization based on the BFU is not presented, while MorphoSys is a well- The REMARC system [7] is similar in design to the MorphoSys architecture and targets the same class of data-parallel and high-throughput applications. Like MorphoSys, it also consists of 64 programmable units (organized in a SIMD manner) that are tightly coupled to a RISC processor. REMARC also uses a modified As evident, all the above systems vary greatly. However, the MorphoSys architecture puts together, in a cohesive structure, the main prominent features of previous reconfigurable systems (coarse-grain granularity, SIMD organization, depth of programmability, multi-level configurable interconnection network, and dynamic reconfiguration). This architecture then adds some innovative features (control processor with modified ISA, streaming buffer that allows overlap of computation with data transfer, row/column broadcast, selective context enabling) while avoiding many of the pitfalls (single contexts, i/o bottlenecks, static reconfiguration, remote interface) of previous systems. In this sense, MorphoSys is a unique implementation.
In summary, the important features of the MorphoSys architecture are:
• Integrated model: Except for main memory, MorphoSys is a complete system-on-chip.
• Innovative memory interface: In contrast to other prototype reconfigurable systems, MorphoSys employs an efficient scheme for high data throughput using a two-set data buffer.
• Multiple contexts on-chip: Having multiple contexts enables fast single-cycle reconfiguration.
• On-chip general-purpose processor: This processor, which also serves as the system controller, allows efficient execution of complex applications that include both serial and parallel tasks.
To the best of our knowledge, the MorphoSys architecture, as described earlier, is unique with respect to other published reconfigurable systems.
Programming and Simulation Environment
Behavioral VHDL Mode l
The MorphoSys reconfigurable system has been specified in behavioral VHDL. The system components namely, the 8x8 Reconfigurable Array, the 32-bit TinyRISC host processor, the Context Memory, Frame
Buffer and the DMA controller have been modeled for complete functionality. The unified model has been subjected to simulation for various applications using the QuickVHDL simulation environment. These simulations utilize several test-benches, real world input data sets, a simple assembler-like parser for generating the context/configuration instructions and assembly code for TinyRISC. Figure 7 depicts the simulation environment with its different components.
GUI for MorphoSys: mV iew
A graphical user interface, mView, takes user input for each application (specification of operations and data sources/destinations for each RC) and generates assembly code for the MorphoSys RC Array. It is also used for studying MorphoSys simulation behavior. This GUI, based on Tcl/Tk [13] displays graphical information about the functions being executed at each RC, the active interconnections, the sources and destination of operands, usage of data buses and the express lanes, and values of RC outputs. It has several built-in features that allow visualization of RC execution, interconnect usage patterns for different applications, and single-step simulation runs with backward, forward and continuous execution. It operates in one of two modes: programming mode or simulation mode.
Figure 7: Simulation Environment for MorphoSys, with mView display
In the programming mode, the user sets functions and interconnections for each row/column of the RC Array corresponding to each context (row/column broadcasting) for the application. mView then generates a context file that represents the user-specified application.
In the simulation mode, mView takes a context file, or a simulation output file as input. For either of these, it provides a graphical display of the state of each RC as it executes the application represented by the context/simulation file. mView is a valuable aid to the designer in mapping algorithms to the RC Array. Not only does mView significantly reduce the programming time, but it also provides low-level information about the actual execution of applications in the RC Array. This feature, coupled with its graphical nature, makes it a convenient tool for verifying and debugging simulation runs.
Context Generation
For system simulation, each application has to be coded into context words and TinyRISC instructions.
For the former, an assembler-parser, mLoad, generates contexts from programs written in the RC instruction set by the user or generated through mView. The next step is to determine the sequence of TinyRISC instructions for appropriate operation of RC Array, timely data input and output, and to provide sample data.
Once a sequence is determined, and data procured, test-benches are used to simulate the system.
Code Generation for Mo rphoSys
An important aspect of our research is an ongoing effort to develop a programming environment for automatic mapping and code generation for MorphoSys. A prototype compiler that compiles hybrid code for MorphoSys M1 (from C source code, serial as well as parallel) has been developed using the SUIF compiler environment [14] . The compilation is done after partitioning the code between the TinyRISC processor and the RC Array. Currently, this partitioning is accomplished manually by inserting a particular prefix to functions that are to be mapped to the RC Array. The compiler generates the instructions for TinyRISC 
Mapping Applications to MorphoSys
In this section, we discuss the mapping of video compression, an important target recognition application (Automatic Target Recognition) and data encryption/decryption algorithms to the MorphoSys architecture.
Video compression has a high degree of data-parallelism and tight real-time constraints. Automatic Target Recognition (ATR) is one of the most computation-intensive applications. The International Data Encryption Algorithm (IDEA) [30] for data encryption is typical of data-intensive applications. We also provide performance estimates for these applications based on VHDL simulations. Pending the development of an automatic mapping tool, all these applications were mapped to MorphoSys either by using the GUI, mView or manually.
Video Compression (MP EG)
Video compression is an integral part of many multimedia applications. In this context, MPEG standards [15] for video compression are important for realization of digital video services, such as video conferencing, video-on-demand, HDTV and digital TV.
As depicted in Figure 8 , the functions required of a typical MPEG encoder are:
• Preprocessing: for example, color conversion to YCbCr, prefiltering and subsampling.
• Motion Estimation and Compensation: After preprocessing, motion estimation of image pixels is done to remove temporal redundancies in successive frames (predictive coding) of P type and B type. Algorithms such as Full Search Block Matching (FSBM) may be used for motion estimation.
• Transformation and Quantization: Each macroblock (typically consisting of 6 blocks of size 8x8 pixels)
is then transformed using the Discrete Cosine Transform (DCT). The resulting DCT coefficients are quantized to enable compression.
• Zigzag scan and VLC: The quantized coefficients are rearranged in a zigzag manner (in order of low to high spatial frequency) and compressed using variable length encoding.
• Inverse Quantization and Inverse Transformation: The quantized blocks of I and P type frames are inverse quantized and transformed back by the Inverse Discrete Cosine Transform (IDCT). This operation yields a copy of the picture, which is used for future predictive coding.
Q u a n tiz a t io n D C T In v . Q u a n t. 
ID C T F r a m e
O u tp u t
Motion Vectors
In p u t
Predictive frame Z ig -z a g sc a n
Figure 8: Block diagram of an MPEG Encoder
Next, we discuss two major functions (motion estimation using FSBM and transformation using DCT) of the MPEG video encoder, as mapped to MorphoSys. Finally, we discuss the overall performance of MorphoSys for the entire compression encoder sequence (Note: VLC operations are not mapped to
MorphoSys, but Section 7.1.3 shows that adequate time is available to execute VLC after finishing the other computations involved in MPEG encoding).
Video Compression: Mo tion Estimation for MPEG
Motion estimation is widely adopted in video compression to identify redundancy between frames.
The most popular technique for motion estimation is the block-matching algorithm because of its simple hardware implementation [17] . Some standards also recommend this algorithm. Among the different blockmatching methods, Full Search Block Matching (FSBM) involves the maximum computations. However, FSBM gives an optimal solution with low control overhead.
Typically, FSBM is formulated using the mean absolute difference (MAD) criterion as follows: show that a total of 5304 cycles are required to match the search area. For an image frame size of 352x288 pixels at 30 frames per second (MPEG-2 main profile, low level), the processing time is 2.1 x 10 6 cycles.
The computation time for MorphoSys (@100 MHz) is 21 ms. This is smaller than the frame period of 33.33 ms. The context loading time is only 71 cycles, and since there are a large number of computation cycles before the configuration is changed, the overhead is negligible.
Performance Analysis: MorphoSys performance is compared with three ASIC architectures implemented in [17] , [18] , and [19] for matching one 8x8 reference block against its search area of 8 pixels displacement. The result is shown in Figure 10 . The ASIC architectures have the same processing power (in terms of processing elements) as MorphoSys, though they employ customized hardware units such as parallel adders to enhance performance. The number of processing cycles for MorphoSys is comparable to the cycles required by the ASIC designs. Since MorphoSys is not an ASIC, its performance with regard to these ASICs is significant. In a Section 7.1.3, it is shown that this performance level enables implementation of MPEG-2 encoder on MorphoSys.
Using the same parameters above, Pentium MMX [20] takes almost 29000 cycles for the same task.
When scaled for clock speed and same technology (fastest Pentium MMX fabricated with 0.35 micron technology operates at 233 MHz), this amounts to more than 10X difference in performance. Mapping to RC Array: The standard block size for DCT in most image and video compression standards is 8x8 pixels. Since the RC Array has the same size, each pixel of the image block may be directly mapped to each RC. Each pixel of the input block is stored in one RC.
Sequence of steps:
Load input data: An 8x8 block is loaded from the Frame Buffer to RC Array. The data bus between Frame Buffer and RC Array allows concurrent loading of eight pixels. An entire block is loaded in 8 cycles.
The same number of cycles is required to write out the processed data to the Frame Buffer. [20] . Processors, such as the TMS320
Row-column approach:
series [16] , also expend some cycle time on transposing data.
Precision analysis for IDCT:
Experiments were conducted for measuring the precision of MorphoSys IDCT output values as specified in the IEEE Standard [23] . Considering that MorphoSys is not an ASIC, and performs fixed-point operations, the results were impressive. The worst case pixel error was satisfied and the Overall Mean Square Error (OMSE) was within 15% of the reference value. The majority of pixel locations also satisfied the worst case reference values for mean error and mean square error.
Zigzag Scan:
The zigzag scan function has also been implemented, even though MorphoSys is not designed for applications that comprise of irregular accesses. However, the selective context enabling feature of the RC Array was used to generate an efficient mapping. The fact that an application quite diverse from the targeted applications could still be mapped to the MorphoSys architecture provides evidence of the flexibility of the MorphoSys model.
Mapping MPEG-2 Vide o Encoder to MorphoSys
It is remarkable that because of the computation intensive nature of motion estimation, only dedicated processors or ASICs have been used to implement MPEG video encoders. Most reconfigurable systems, DSP processors or multimedia processors (e.g. [16] ) consider only MPEG decoding or a sub-task (e.g. IDCT). Our mapping of the complete MPEG encoder to MorphoSys is perhaps the first time that a reconfigurable system has been able to meet the high throughput requirements of the MPEG video encoder.
We mapped all the functions for MPEG-2 video encoder, except VLC encoding, to MorphoSys. We assume that the Main profile (low level) is being used. The maximum resolution at this level is 352x288 pixels per frame at 30 frames per second. The group of pictures consists of a sequence of four frames in the order IBBP (a typical choice for broadcasting applications). The number of cycles required for each task of the MPEG encoder, for each macroblock type is listed in Table 3 . Besides actual computation, the number of cycles required for loading configuration and data from memory is also included in the calculations. All macro-blocks in each P frame and B frame are first subjected to motion estimation. Then motion compensation, DCT and quantization are performed on each macroblock in a frame. The processed macroblocks are sent to frame storage in main memory. Finally, we perform inverse quantization, inverse DCT and reverse motion prediction for each macroblock of I frames and P frames. Each frame has 396 macroblocks, and clock cycles required for encoding each frame type are depicted in Figure 14 . It may be noted that motion estimation takes up almost 90% of the computation time for P and B type frames. From the data in Figure 14 , and assuming IBBP frame sequence, total encoding time is 117.3 ms. This is 88% of available time (134 ms). From empirical data values in [22] , 12% (remaining time) of available time is sufficient to compute VLC. Table 4 shows that figures for MorphoSys MPEG encoder (without VLC) are two orders of magnitude less than the corresponding figures for REMARC [7] . The algorithm (FSBM) for motion estimation, which is the major computation, is the same for REMARC and MorphoSys. 
Automatic Target Recog nition (ATR)
Automatic Target Recognition (ATR) is the machine function of automatically detecting, classifying, recognizing, and identifying an object. The ACS Surveillance challenge has been quantified as the ability to search 40,000 square nautical miles per day with one-meter resolution [24] . The computation levels when targets are partially obscured reaches the hundreds-of-teraops range. There are many algorithmic choices available to implement an ATR system. The ATR processing model developed at Sandia National Laboratory is shown in Fig. 15 [25, 26] .
This model was designed to detect partially obscured targets in Synthetic Aperture Radar (SAR) images generated by the radar imager in real time. SAR images (8-bit pixels) are input to a focus-of-attention processor to identify regions of interest (called chips). These chips are thresholded to generate binary images and the binary images are then matched against binary target templates. Target templates appear in pairs of a bright and a surround template. The bright template identifies locations where a strong radar return is expected, while the surround template identifies locations where strong radar absorption is expected.
Based on [25] , the sequence of steps involved are: first, the 128 x 128 x 8 bit chip is sliced into eight bitplanes to compute the shapesum, which is a weighted sum of the eight results obtained by correlating each bitplane with the bright template. Once the shapesum is generated, the next step is correlating the actual target templates with the chip. The correlation is performed on eight different binary images that are generated by applying eight threshold values to the chip. The binary images are correlated with both the bright and surround template pairs to generate eight pairs of correlation results, and the shapesum is used to select one of the eight results. The selected pair of results is subsequently forwarded to the peak detector.
Both shapesum computation and target template matching, which are the most computation intensive steps in the ATR processing model, require bit correlation. Figure 16 illustrates the operation of the bit correlator implemented in MorphoSys M1. Each row of the 8x8 target template is packed as an 8-bit number and loaded in the RC Array. All the candidate blocks in the chip are correlated with the target template.
Each column of the RC Array performs correlation of one target template with one candidate block, hence eight templates are correlated concurrently in the RC Array. In order to perform bit-level correlation, two bytes (16 bits) of image data are input to each RC. In the first step, the eight most significant bits of the image data are ANDed with the template data and a special adder tree (implemented as custom hardware in each RC) is used to count the number of one's of the ANDed output to generate the correlation result. Then, the image data is shifted left one bit and the process is repeated again to perform matching of the second block. After the image data is shifted eight times, a new 16-bit data is loaded and the RC starts another correlation of eight consecutive candidate blocks. For this processing model, it takes four clock cycles to correlate one 8x8 binary image with an 8x8 target template.
Performance analysis: For analysis, we choose system parameters implemented in [25] . The ATR systems from [25] and [26] This processing time is about an order of magnitude lower than the 210 ms required for the FPGA system in [25] , and 195 ms for the Splash 2 system [26] . Even though MorphoSys M1 is a coarse-grained system, it achieves similar performance to FPGA based systems (after accounting for clock rate scaling) for the bit-level ATR operations. FPGAs are however limited in applicability to mostly bit-level operations, and are inefficient for coarse-grain operations. These results are shown in Figure 17 . [25] states that 100 chips have to be processed each second for a given target. Each target has a pair of bright and surround templates for every five-degree rotation (72 pairs for full 360-degree rotation). Considering these requirements, nine chips of MorphoSys M1 (64 RCs) would be needed to satisfy this specification, as compared to 90 sets of system described in [25] and 84 sets of the Splash 2 system [26] .
Data Encryption/Decryp tion (IDEA Algorithm)
Data security is a key application domain. The International Data Encryption Algorithm (IDEA) [30] is a typical example of this application class. IDEA involves processing of plaintext data (data to be 
Conclusions and Future Work
This paper has presented a new reconfigurable architecture, MorphoSys. Its performance has been evaluated for many of the target applications with impressive results that validate this architectural model.
Work on the physical implementation of MorphoSys on a custom-designed chip is in progress.
Extensions for MorphoSys model:
It may be noted that the MorphoSys architecture is not limited to using a simple RISC processor as the main processor. TinyRISC is used in the current implementation only to evaluate the design model. There are many options for the main processor. One would be to use an advanced general-purpose processor in conjunction with TinyRISC (which would then function as an I/O processor for the RC Array). Also, an advanced processor (with multi-threading) may be used as the main processor to enable concurrent processing of application programs by the RC Array and the main processor.
Another potential focus is the RC Array. For this implementation, the array has been designed for dataparallel, computation intensive tasks. However, the design model allows other versions, too. For example, a suitably designed RC Array may be used for a different application class, such as high-precision signal processing, bit-level computations, control-intensive applications, or dynamic stream processing.
Based on this, we visualize that MorphoSys may be the precursor of a generation of systems that integrate advanced general-purpose processors with a specialized reconfigurable component, in order to meet the constraints of mainstream, high-throughput and computation-intensive applications. 
Acknowledgments
