INTRODUCTION
Recontigurable computing architectures are gaining popularity as replacements for general purpose architectures in many high performance applications. Reconfigurable systems can take advantage of deep computational pipelines, perform concurrent execution and are inherently data flow in nature. Many applications can exploit these systems, such as genomic sequence scanning. Fast Fourier Transform, text searching. and computer vision. Current research efforts are applying reconfigurable computing to perform automatic target recognition, real-time image processing, and hardware implementation of neural networks. However, these architectures suffer from a trade off between slow reconfiguration times versus low logic gate densit'v when used to support large computations. This problem is due to the fact that configuration memory typically resides off-chip and reconfiguration is performed serially. Recent approaches4 solve this problem by adding an on-chip configuration cache that provides faster reconfiguration at the cost of die area. That is, the area overhead of the configuration cache gives a low total logic gate density for the architecture. These disadvantages limit the performance, and therefore the applicability of current reconfigurable systems.
In this paper. a reconfigurable processor architecture is proposed that overcomes the limitations discussed above by using high bandwidth optical channels. The optical channels allow fast parallel loading of the reconfiguration control word as well as the migration of the configuration cache off-chip. The migration of configuration cache allows better utilization of the die area for reconfigurable processing elements. Further, it is possible to implement the optical detectors directly in silicon, hich does not require significant alteration of the fabrication processes. These advantages make the optically reconfigurable architecture competitive for high performance applications.
PROPOSED ARCHITECTURE
Field Programmable Gate Array (FPGA) reconfigurable systems are implemented using arrays of reconfigurable processing elements such as the one shoii in Figure I . Each element can realize the functionality of a logical operation under the control of a configuration hzf/er. which is capable of storing a single configuration for the processing element. The complexity of the reconfigurable element varies in different FPGA technologies. We use the term gate equivalent to represent such a generic processing element. Configuration buffers are also used to configure the interconnect between processing elements such that that groups of processing elements can realize more complex functionality which are called microoperations.
Reconfiguration of the processing elements is performed by reading the data for all the configuration buffers stored off-chip in a Read Only Memory (ROM) module. The reconfiguration is slow due to two factors. First, because the configuration data resides off-chip. reconfiguration of the processing elements requires long time delays to access the data from the ROM. Second, reconfiguration in these systems is performed serially. That is, after the control word is fetched from the ROM it is shifted seriall to the configuration buffers of the reconfigurable processing elements. Due to long reconfiguration times. these systems are typically used as configure-once application-specific compute systems. Therefore, they do not fully exploit the flexibility offered by recontigurable hardware. To fully take advantage ofthe reconfigurabie hardware and perform different computations in real time, we must perform "on the liv" reconfiguration or run time reconfiguration. Current architectures overcome the slow reconfiguration time by adding on-chip configuration cache to the reconfigurable processing module. The on-chip cache can alleviate the problem of slow off-chip memory access to fetch the needed configuration data. Hence, the reconfiguration time is reduced at the cost of die area. Current designs that implement this approach estimate the overhead of on-chip configuration cache to approach 50% of the total die area. This overhead is very high and hence limits the capability of such systems to represent complex functionality in hardware.
In order to provide maximum utilization of the die area for computation. we propose a reconfigurable processing system that has the configuration cache off-chip and high-speed parallel optical channels for loading configurations. An array of optical detectors is added to the die of the reconfigurable processing unit. Each detector can receive configuration data for a large oroup of processing elements since each optical channel offers a high bandwidth connection to memor). The configuration data is transmitted optically in a 2D fashion to each of the on-chip detectors achieving very fast parallel reconfiguration of the processing elements. This architecture is illustrated in Figure 2 . This architecture overcomes the limitations of current reconfigurable s\stenls. it enjoys fast run time reconfiguration as well as full utilization of the processing resources. 
PERFORMANCE ANALYSIS
In this section we develop a performance measure to compare architectures that employ on-chip configuration cache and a single channel to off-chip configuration memory versus architectures that do not utilize any on-chip cache but use multiple optical channels to load off-chip configurations. We focus our model to take into effect the performance parameters that will affect the total execution time and specifically the configuration time for the architectures mentioned. In other words, we focus on studs ing the effectiveness of both architectures at lowering the run time reconfiguration time. A more detailed analysis. in vhich a general reconfigurable processor performance model was developed, has been presented in previous work'.
I. Performance Modeling
We define the performance of a processor as the time a processor needs to complete the execution of an application. In general, the execution time can be defined as the total time needed per configuration of the hardware times the number of configurations required to execute the application. The time needed per hardware configuration consists of three parameters. First, is the time to load the configuration data that defines the functionality represented in the logic, T('. Second, is the time to read write the data needed to perform the computation. T1. Third. is the time needed to execute the operations defined on the available data T1. In this summary we focus only on the effects of 7 and Tr on total execution time. Therefore, for this studs' we define total execution time as: T = (7. + T1 ) x C The number of configurations needed. C'. is the total number of microoperattons. Al. that define the application divided by the average number of microoperations that can be represented in a single hardware configuration, This is a function of both the processing area available on the chip and the structure of the application being executed, as explained below.
Tile total processing area or die capacity of the processor can be represented as the total number of gate equivalents available on-chip. G. lloever. this die capacity is not full> used as processing area, since some of the die area is used as configuration cache for some architectures and as photo detectors for other architectures. Hence, the total available processing area Gproc G -Gc,ache where Gcach. 5 the die area needed for configuration cache measured in gate equivalent area. For architectures employing optical channels Gproc G -G011 , where is the area required to implement the photo detectors measured in gate equivalents.
We define Mproc as the maximum number of microoperations that can be realized simultaneously in a single hardware configuration Mproc Gproc I G7 , where Gm iS the number of gate equivalents needed to realize a single microoperation.
However, in general, the actual number of microoperations that can be realized simultaneously in a single hardware configuration is a function of the parallelism in the application, the partitioning efficiency, and the average amount of hardware reuse that takes place from one configuration to the next. Therefore, Mcnjjg rX Mproc ' where r captures these effects. Hence, C = M 1 Mcntig . The number of gate equivalents required to represent Mconfig microoperations in a single hardware configuration is G,,1jg M),,fjg X G7 . We define Bg as the number of configuration bits needed to configure a single gate equivalent. Hence, to configure GCfrI1g a total of Bg x Gcntjg configuration bits are needed.
T( 5 the time it takes to load the configuration data that is required to realize the M.ontig microoperations used in a single hardware configuration. For the architecture that employs on-chip configuration cache, when reconfiguration is required, the needed configuration bits could either reside in the on-chip cache or off-chip in configuration memory. The probability that a needed configuration can be found in the cache, or the cache hit-rate, is denoted as Ph,i and the miss rate as Pmiss. Hence, the time required to reconfigure the processor 1. = Pt,,, x nchip mics X "offchip ' where TOflCh, 5 the time to access a single configuration from on-chip configuration cache and is the time to access a single configuration from off-chip configuration memory. We assume that the access to on-chip cache to retrieve the bits that constitute a single configuration is performed in parallel. Hence, TOflCh,p is defined as the time to access a single configuration bit from on-chip cache.
Furthermore, since this architecture uses a single port to memory, Tffth,P = (Bg x Gcoflfig ) 1 Seiec ' where Seiec S the bandwidth ofthe electronic serial channel to memory.
For the second architecture that employs optical channels, all configuration bits must be loaded from off-chip memory, T = 'offchip In this architecture, N optical channels connect the reconfigurable processor to configuration memory. Therefore, ')ffCh!p (Bg x GCflhjg ) I(S0,1 x N) . With N channels accessing configuration memory in parallel, T0ffh, the time of the channel that needs to retrieve the most bits from off-chip memory. In this model, we assume the worst case number of bits that are needed per optical channel.
Performance Comparison
In order to compare these two reconfigurable processors we use the execution time function (7) defined above. Since we assume that processors are fabricated using the same technology, we can assume that TE is the same across processors and hence can be treated as a constant. We compare the performance oftwo processor architectures executing a single application consisting of I 0,000,000 microoperations (M 10,000,000). Both processors have the same die area, which is G = 100,000 logic gate equivalents. We compare the effect of adding more communication channels to configuration memory on the total execution time to the effect of adding more configuration cache on the total execution time.
The first architecture employs on-chip configuration cache, Gcache, with an on-chip access time (T0h,) of 2ns and a single electrical channel to off-chip configuration memory with a bandwidth (Sejec) of 50MHz. The cache hit rate (Ph,1) for the first architecture is modeled versus cache size using the well known parachor curve' as follows: Also, in Figure 3 , for the optically reconfigurable architecture, we plot the total execution time, T, versus the number of channels, N, to configuration memory. The number of optical channels is varied from 1 to 20. The more channels used to access configuration memory, the shorter the execution time.
The analysis demonstrates that even with a few optical channels a significant improvement in performance can be achieved in comparison to systems using part of the die area for configuration cache. Further, when the number of channels is increased to I 6-20, still a relatively small number, a substantial advantage over corresponding cache based systems can be achieved even for applications that exhibit very high degrees of locality.
