Abstract: An embedded architecture of optical vector matrix multiplier (OVMM) is presented. The embedded architecture is aimed at optimising the data flow of vector matrix multiplier (VMM) to promote its performance. Data dependence is discussed when the OVMM is connected to a cluster system. A simulator is built to analyse the performance according to the architecture. According to the simulation, Amdahl's law is used to analyse the hybrid opto -electronic system. It is found that the electronic part and its interaction with optical part form the bottleneck of system.
Introduction
Optical computing has attracted much attention since 1970s, especially the optical matrix algebra processors [1, 2] . Optical vector matrix multiplier (OVMM) is one of the most important matrix algebra processors, because vector matrix multiplier (VMM) operation is widely used in scientific calculation. Besides, OVMM is also an important optical interconnection technology to implement cross-bar network [3] . The OVMM has been demonstrated using free-space optics [1, 4 -6] and waveguide optics [7] . Whichever of the two is actually an electronic -optical hybrid system, so not only the optical part and the electronic part but also the interface between them are crucial to the whole system. In this paper, we focus on how to use the full-fledged computer and electronic technology to develop the performance of the OVMM. Fig. 1 demonstrates the basic schematic diagram of the reflecting type OVMM. The light emitting diode or vertical-cavity surface emitting laser (VCSEL) array driven by circuit, which presents the input vector, projects light with different power to corresponding column of spatial light modulator (SLM). The absorption rate of each pixel of the programmable SLM, which acts as the matrix element, can be changed by the driving circuit. When each detector in the array collects the signals from each corresponding row, the digital result can be obtained through an analogue/digital (A/D) conversion circuit.
The developments of high-speed large-size vertical-cavity multiple quantum well electro-absorption SLM [8] and high-speed VCSEL [9] make it possible to build an OVMM to carry out a large amount of multiple accumulation operation. However, in order to handle a complex task (an example is given in Section 4), it needs to be built as a system with the ability to execute some other operations such as memory, communication with other processors, and so on. For these reasons, the system is designed as an embedded general purpose processing system, which is named as optical digital signal processor (ODSP). To exploit better performance of the ODSP, it is designed to be able to connect to other processing system, such as distributed shared memory cluster system (detailed in Section 2). When this kind or similar architecture is adopted, coherence problem will be avoided, which will be discussed in Section 3.
(CU), instruction pool and bus supervisor. The optical system shown in Fig. 1 and its driving electronic circuit form the OVMM. It is the computation core of ODSP which carries out the VMM operation. For the reason of bandwidth and compatibility, an aligned bit data bus is adopted for VMM.
The registers and caches form the memory hierarchy of ODSP. The design of memory hierarchy must find the balance between price and speed. The VMM operation often involves in transfer of large matrix data, especially when the SLM needs to be refreshed. The transfer of SLM data needs a great latency. To speed up it, one way is to enlarge the bandwidth of the data bus, but it would be restricted by the practical size of device. Another possible method is to enhance the speed of driving circuit of SLM, which usually requires high-speed external memory. However, high-speed memory with large size is expensive and impractical, so caches are necessary.
In our architecture, CU is the control core. It is mainly composed of instruction fetch unit, instruction decoder and instruction implementation part. The ODSP instructions sometimes need much more than one operation cycle, so introduction of three-stage CU is good for both pipeline and the long latency instructions. The CU fetches instructions from the instruction pool which is a first-in-first-out stack.
According to Flynn [10] , the ODSP shown in Fig. 2 itself can be classified as a single instruction stream and multiple data streams (SIMD) processor. Some instructions of ODSP, such as VMM operation, can actually execute multiple data streams operation in one instruction.
Depending on the applications, the OVMM processor could be connected to different systems. For example, a digital signal processor could be provided as main controller, when high-speed parallel input and output interfaces are provided [11] . This kind of design is especially favour of the situation to deal with a large amount of high-speed input and output vector data flow, for example, in application of universal mobile telecommunication system.
In other situation, such as taking compatibility into consideration, the OVMM processor can be connected to a computer [4, 5, 12] or a multiprocessor [13] . Technology in computer science can also be used, such as distributed shared memory cluster system.
The system shown in Fig. 3 is a distributed shared memory cluster system. Besides the shared memory, each processor has its own memory in order to reduce the burden of bus. Because of the complexity of communication between processors, the bus supervisor is introduced to act as the manager to coordinate data or instructions between ODSP, IO equipment, private memory and public memory.
In operating system level, the processes are all waiting in a message queue. When the ODSP is connected to a cluster system as an embedded processor, there are more than one message queue. ODSP acts as a part of multiple instruction streams and multiple data streams (MIMD) system [12] . The master CPU has more resource to allocate for processes, so programs can run simultaneously. If more than one ODSP is connected to the network, the master CPU has more freedom to distribute VMM tasks to the idle ODSP through the bus.
Data dependences
When being connected to a cluster system (such as Fig. 3 ), several copies of data in memory of ODSP and other processors would cause dependence problem [14] . Fig. 4 is a program segment including ODSP operation. Instruction in L1 in the program is to dynamically allocate a global point, which points to the head address of a matrix memory. The instructions between PushV and EndPushV are the processes that will be passed to ODSP and operated by it. In L2, data are copied from IO device i to the matrix memory A and the cache of ODSP.
In the program, what the master CPU does is to allocate a block of memory and pass instructions to ODSP. Because the master CPU and ODSP have their own message queue, they do not know whether L3 happens before or after action L4. If L4 should happen before L3, L3 tries to read A before L4 happens and then a read-after-write hazard occurs. If L4 should happen after L3, L4 tries to write A before L3 and then a write-after-read hazard occurs. Similarly, the processor of L5 does not know whether data in A come from L2 or L4, which may lead to a write-after-write hazard. As a method to prevent these hazards, the system makes the memory operations sequential. If the point A is in the instruction pool, the system makes the memory block inaccessible. Fig. 5 demonstrates the idea that when memory block A is in the queue of an ODSP instruction pool, it is preserved to be written or read from other processors. If the master CPU tries to write A, it has to either switch to other processes in its message queue or wait until access ban of A is cancelled.
Model of ODSP and analyses
To study the performance of ODSP, a simulator of ODSP is built according to the embedded architecture using the following parameters:
1. The SLM is 256 × 256, 8 bits. To test the performance of ODSP, a program which can embody a lot of vector matrix multiplications is chosen to run on the simulator. One of the typical examples is 2-D template matching [15] . Template matching is a method to find out whether there is an object in the target pattern likely enough to the source object in template. The k × j template pattern is named picture A, whereas the other one is m × n picture that is waiting to be checked, which is named picture B. The basic operation of template is cross-correlation
where p and q represent the x and y distance from the right bottom corner of picture A to the left upper corner of picture B:
The scope of s, t covers the overlapped area of the two patterns, which is to say the whole area of pattern B:
In order to highlight the activity of ODSP rather than to demonstrate details of template matching, some preprocesses and postprocesses to the pictures have been done. First, the pattern is normalised before the operation. Our arithmetic is based on (2)
Secondly, the shape, size, grey level and direction of patterns in the two pictures are all supposed to be the same. Finally, the algorithm is optimised to make sure that the ODSP can be used effectively, according to the characteristics of the ODSP and the size of SLM. In our program, the template pattern is a 32 × 32, 8 bits grey picture and the patterns B is 256 × 256, 8 bits grey one. In the algorithm, the pattern B is duplicated eight times, shown in Fig. 6 , as a fast algorithm in the simulator.
According to the simulator, the system performance can be analysed in the assembly language level. Fig. 7 demonstrates that the VMM operation period occupies only about 0.83% of the whole program, whereas the percentage is 18.09% according to the assembly code generated by Microsoft Visual C++ 6.0 (VC6). Another comparison is also made by analysing the 1024 discrete Fourier transform. The percentage of VMM operation is only about 0.006% in ODSP, whereas it is 13.33% in VC6 assembly language. It is confirmed from the model that the OVMM is an effective method to execute VMM operation, because it makes the arithmetic operations occupy a small percentage of the whole operation.
The model also demonstrates that the electronic part is the bottleneck in the hybrid opto -electronic system. To study how the electronic part affects the system performance further, Amdahl's law [16] can be used to the model above. The law can be simply concluded as
In which, r s is the part of an entire task without enhanced and r p ¼ 1 2 r s is the part that has been enhanced. The n represents how many times the r p part is faster than before.
Amdahl's law is a fundamental law in computer architecture science to evaluate the system performance gain when some portion of the system is improved. Fig. 8a is drawn according to Amdahl's law in 2-D template matching on the model, when an isolate factor is varied. The ODSP enhances the performance obviously as the speedup increases steeply when n , 0.32. However, the increasing trend becomes flat when n . 0.32. It demonstrates that even though the architecture in the paper is to try and make electronic part go with optical part, the performance of register, memory, vector operations still form bottlenecks of the system. To enhance the speed of VMM simply would be useless to the overall performance of the system. From Fig. 8a , even if these factors' performance is enhanced isolatedly, the contribution to the system is unimportant when the enhancement is greater than 100. For the whole system, the performance enhancement calls for all parts of the work speedup coordinately. The lag of a factor would limit the system performance.
However, the effort to enhance the performance of OVMM is still important for the whole system. First, the speedup of VMM makes the plane shown in Fig. 8b move to the direction with larger value parallelly, which gives more space to the other parts to enhance system performance. Secondly, the operation of OVMM is never isolate. Operations, such as memory or IO operations, have close connections with VMM operation in ODSP.
Some program parameters would also affect the proportional ratio of the work load. It is studied as an example that the VMM operation varies with parameter of the picture B or template size in 2-D template matching. As shown in Fig. 9 , when the picture B or template size is relatively small, memory and register initial operations occupy relatively high percentage of the total operation. When the picture B or template size is increasing, some data in the memory does not need to be updated, so the percentage of VMM operation increases. However, when the size of picture B or template increases to some extent, the proportion of VMM operation becomes stable. For this reason, the increase is limited by both the memory size and finite percentage of saved operations according to Amdahl's law.
As mentioned above, changes on some system parameters, such as SLM size, would have effect on more than one operation. Different SLM sizes have been taken into consideration to carry out 2-D template matching, as shown in Fig. 10 . In order to use the same algorithm, the width and height of input picture B are 2048 times as the original mode (256). Although VMM operation occupies only a small percentage, the total instruction number of 2-D template matching decreases with the increase of SLM size when it is relatively small. The reason is that the increase of SLM size could also reduce some register and memory operations until reaching the limit according to Amdahl's law. Although the increase of SLM size is effective to reduce the total instruction number when SLM is relatively small, it is really a challenge to technology available. For example, the SLM requires a resolution of 1024 × 1024, if the total instruction number is reduced to 30% compared to 256 × 256 SLM on the condition above. 
Conclusion
In this paper, an embedded architecture of VMM is presented. The design purpose of the architecture is to avoid the bottleneck of the whole system and make it compatible with the developed technology in computer area. When being connected to cluster, the data dependence problem in ODSP should be avoided. A simulator of ODSP is built according to the architecture and a 2-D template matching program is run on it. Based on the program, the factors to speed up the system are analysed according to Amdahl's law. The electronic part and its interaction with optical VMM are the bottleneck of the optical-electronic mixed system. Enhancement of an isolate factor impacts little on the system. Therefore the optical architecture designer should focus on both optical and electronic part of the system, because enhancement on some overall factors is more effective to system.
