Abstract-A custom processor called MAPLE which supports static scheduling by automatic parallelizing compilers is implemented and evaluated. MAPLE has a high performance floating point arithmetic unit and low latency data transfer mechanism for other MAPLE chips. The maximum operational frequency is 80MHz in simulation, and the operation on the prototype board with 23MHz clock is confirmed. It requires about 0.56W at 23MHz operation.
I. INTRODUCTION
Although it is easy to enhance the peak performance of the multiprocessor only by adding a number of processing units, it is difficult to exploit effective performance for users without support of automatic parallelizing compilers. However, such compilers have been tailored for existing multiprocessors which are designed without care of them well. The multiprocessor system ASCA(Advanced Scheduling oriented Computer Architecture) has been proposed based on the idea that not the parallelizing software is tailored for machines, but a multiprocessor system should be designed to make the best use of parallelizing software [ I] .
In ASCA, a multi-grain parallelizing compilation scheme is adopted [2] . The scheme can exploit parallelism of the user program in various levels of granularity: coarse-grain parallelism (macro-data Row computation), medium-grain parallelism (loop-level parallelism) which is used in most of the current compilers, and near-fine-grain parallelism(statement level parallelism). Since the near-fine-grain parallelizing compilation especially requires a precise static scheduling between operations, uncertain behavior of the processor must be completely excluded. To solve the problem, we have proposed the custom processor MAPLE(Mu1tiprocessor system ASCA Processing eLEment)[ 1 1.
CHIP FEATURES
MAPLE is a 32-bit RISC processor which provides a simple structure with highly predictable operations. Its instruction set is an extension of that of DLX [3] . Five-stage pipeline structure of MAPLE is designed to execute every operation in a fixed number of clocks. 32-biv64-bit IEEE std 754-1985 Roatingpoint units are provided [4] .
There are 32 integer registers and 32 floating point registers.
Furthermore 16 32-bit special registers calltd receive registers for the low latency data transfer between other MAPLE chips are provided. The receive register is directly connected to instruction decode stage of MAPLE pipeline, and when the source processor executes a transfer operation between registers, data is directly sent out from the memory access stage of the pipeline. The transferred data is also directly received by a receive register of destination processor. Fig. I (Fig. 3) . It provides software cache control system, 512k-byte main memory, 32kbyte instruction RAM, 32kbyte flash ROM, and a serial interface. When the system is starting up, a monitor program in the flash ROM runs. Under the management of the monitor program, the user program code is loaded from the host computer through aserial interface, and executed.
1.525
174,010
Fig. 3 . Prototype PE board
This board is operational at 23MHz clock which is much less than 80MHz which is the target frequency of the MAPLE chip.
The main reason of the frequency degradation is that the YO pins assignment error was found after the board fabrication, and the MAPLE chip is mounted on a large daughter board for replacing the pin connections, which introduces various electronic problems.
B. Performance
The performance of the MAPLE chip with 23MHz clock is evaluated by 7r-series-calculation which includes 30,000 iterations, and shown in Table 111 . For the comparison, the execution result with UltraSPARC-11, which is a chip with similar level of technology, is also shown in the table [6] . The MAPLE chip is designed as a PE for multiprocessor.
We simulated 4 PES as 1 cluster, and found out that 7r-seriescalculation performance of the cluster with using receive registers is about 2.25 times higher than that of single PE [5] . If the MAPLE chip works 80MHz clock as designed, and runs with 4 PES, the execution time becomes 95.8ms, 7.83 times better than the result in the table. Since a single MAPLE chip with 23MHz clock requires 0.56W power consumption. However, since power reduction techniques are not used in this design, this value can be much reduced.
V. CONCLUSIONS
The MAPLE chip is an element processor for astatic scheduling centric multiprocessor multiprocessor ASCA. Although the performance is lower than UltraSPARC-11, the number of gates and power consumption of the MAPLE chip is so small that the costlperformance and power/performance of the MAPLE cluster has possibility to contend with the recent supersclar processors.
