We propose a low-power resource-shared VLIW processor (RSVP) for future leaky nanometer process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way VLIW processor sharing the parallel resources according to priorities of given tasks. RSVP allocates shared parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that is wasting power. The performance per power (P 3 ) of a 4-parallel 4-way RSVP that corresponds to four 4way VLIWs is 3.7% better than a conventional 4-parallel 4-way VLIW multiprocessor in the current 90 nm process. We estimate that the RSVP achieves 36% less leakage power and 28% better P 3 in the future 25 nm process. We have fabricated an RSVP test chip that contains two IPU and a shared resource equivalent to two 2way VLIWs in a 180 nm process. It is functional at 100 MHz clock speed and its power is 130 mW.
Introduction
In this paper, we propose a resource-shared VLIW processor (RSVP) for efficient on-chip multiprocessing. The RSVP architecture is based on the simultaneous multithreading technology (SMT) [1] . Conventional SMT architectures tend to be applied to high-performance microprocessors. We propose the RSVP architecture for embedded fields, in which area and power are limited.
The RSVP consists of several single-way independent processing Units (IPUs) that share parallel resources. Each IPU works as a variable-way VLIW processor with shared processing elements. In the conventional parallel processors, most of hardware resources for parallel processing are idle or execute NOPs almost all the time since the average IPC (instruction per clock) is much smaller than the prepared number of parallel resources. If more hardwares are prepared, the instantaneous IPC can be increased. But they are idle and wasting power at the rest of the time.
Wasting power can be minimized with the well-known power saving techniques. The gated-clock and VT-CMOS technique is effective for active-power saving, while MT-CMOS reduces leakage power. The upcoming nanometer technology, however, makes these conventional power sav- ing technologies insufficient because of leaky transistors. Leakage current is proportional to the chip area. Hardware solutions accelerate given tasks considerably when they have explicit data-level parallelism. But such kind of tasks (applications) are limited to multimedia ones. Software solutions on processors deal with any kind of tasks from highly-parallel multimedia ones to serial ones. But processors occupies large area and their performance is relatively slower than dedicated hardwares. Therefore, it is indispensable to reduce the area of processors for low-power on-chip multiprocessing. The RSVP increases the performance per area, which reduces leakage current dominant in the nanometer era. This paper is organized as follows. Section 2 gives the architecture of the proposed resource-shared VLIW processor (RSVP). In Sect. 3, we evaluate efficiency of the RSVP in terms of performance, area and power compared with the conventional VLIW processors. Section 4 forecasts the total dissipating power including active and leakage power in the future nanometer era. We have fabricated an RSVP test chip in a 180 nm process, which is explained in Sect. 5. We conclude this paper in Sect. 6. Figure 1 shows the architecture of the resource-shared VLIW processor, RSVP. The left side shows a conventional symmetric multiprocessor that contains two 2-way VLIWs. The right side shows an RSVP that contains two independent processor unit (IPU abbreviated as "I") and one shared pipelined processing element (SPPE abbreviated as "S"). It is correspond to two 2-way virtual VLIW processor cores that shares one SPPE. An IPU gets an SPPE if the priority of the given task is higher and it has a capability to issue two instructions at the same time. The number of ways of each IPU can be changed dynamically cycle by cycle. The IPU that runs the higher priority task always wins to get required number of hardware resources. Figure 2 shows a detailed block diagram of the RSVP. It consists of independent and shared portions. An IPU in the independent portion works as a scalar processor having a program counter, a register file, and an independent pipelined processing element (IPPE) executing instructions in the four-stage pipeline. An IPU executes a single instruction per cycle without SPPEs. But it obtains SPPEs when executing multiple instructions in one cycle. At that time, an IPU and and the assigned SPPEs form a VLIW processor. The number of ways can be changed dynamically cycle by cycle using the Pipeline & Forwarding Controller (PFC). Figure 4 shows the structure of PFC. It consists of a priority encoder, shift registers, multiplexers and demultiplexers. Each stage of the shift registers holds the IPU number where the instruction of the stage is sent. The priority encoder chooses the instruction from the IPU which executes the highest priority task. According to the IPU number, the multiplexer receives forwarding data from other SPPEs and IPUs and the demultiplexer sends forwarding data.
Architecture and Behavior of RSVP
If the total number of instructions assigned to all IPUs exceeds the sum of IPUs and SPPEs, an IPU with the higherpriority task wins to obtain SPPEs. When the number of ways of an IPU with the lower-priority task is less than the number of parallel instructions, the possible number of instructions are executed in the current cycle. The rest of instructions are postponed to be executed in the successive cycles. In the conventional VLIW processors, a group of parallel instructions assigned to a single cycle cannot be ungrouped in order to be executed multiple cycles. It is because the execution results will be inconsistent if an instruction uses source registers which will be overwritten by the execution of another instruction at the same cycle. In the proposed RSVP, we prohibit to group the instructions that cannot be ungrouped for the multiple-cycle execution. Figure 3 shows a program execution scheme of the RSVP that contains three IPUs and two SPPEs. It can be configured as the following two structures.
1. one 3-way VLIW + two 1-way VLIWs (Cycle n and n + 2 in Fig. 3 ) 2. two 2-way VLIWs + one 1-way VLIW (Cycle n + 1 in Fig. 3 )
Each IPU runs a statically scheduled 3-way VLIW codes with priorities. The IPU with the highest priority (IPU0) always wins to get required number of SPPEs. The rest IPUs can obtain SPPEs if the code in IPU0 contains NOPs. If an IPU with the lower priority fails to execute prepared VLIW instructions, they are postponed to the next cycle. In Fig. 3, IPU2 fails to execute the instruction in s1 at Cycle n, which is postponed to the next cycle (Cycle n + 1).
Peak performance of parallel processors can be improved according to the number of ways. But it also increases the amount of idle hardware resources. For example, the IPC of an 8-way VLIW processor [3] is 3.14, which means 5 parallel resources are idle or execute NOP on the average. In the upcoming nanometer process, leakage current of transistors becomes dominant compared with the current submicron process in which dynamic and shortcircuit current is dominant. Therefore, these idle parallel resources consume power that cannot be neglected in the future nm process. If they can be used from other processors, the total performance can be increased while reducing the circuit area. We propose the RSVP architecture to activate hardware resources as much as possible, while keeping the hardware size as small as possible to share rarely-used parallel resources. The RSVP keeps the peak performance compared with the multiple VLIW cores that contain the same number of processing elements since the IPU with the highest priority always wins to get the maximum number of shared processing elements.
Efficiency of Parallel Computation: Performance per Power or Area
It is a common sense that parallel computing increases performance per power (P 3 ). Here we evaluate P 3 among synthesizable VLIW processors and the proposed RSVP based on the MIPS architecture [4] . We use seven reference programs written in C as shown in Table 1 to obtain average IPC and MIPS (Mega Instruction Per Second) of each processor.
These reference programs are compiled through a GCC compiler for ordinal scalar processors and then manually converted to the VLIW codes. MIPS values are obtained by running these reference programs on instruction-set simulators (ISS).
Comparisons among Single and Multiple VLIW Processors
Here, we evaluate performance per power and performance per area among three configurations, a 2-way VLIW processor (1X-2W), a double-speed 2-way VLIW processor (2X-2W) and a 2-parallel 2-way VLIW processor (2P-2W). First, a 2-way VLIW processor is synthesized at 100 MHz clock frequency using a 180 nm CMOS library characterized in 1.8 V [6] . It is a MIPS-compatible VLIW processor described at the RTL. It is re-synthesized at 200 MHz clock frequency using a 2.4 V-characterized library. Dissipated power on VLIW processors consists of these two factors.
One is proportional to the number of ways N w and the number of parallel processors N p , the other is proportional to the number of instructions per clock (I VLIW ). Eq. (1) shows the equation of the dissipated power. From the netlist-level simulations, we have obtained the function P f () and the parameter P I as in Eq. (1). Equation (2) is the equation used for 1.8 V. Table 2 shows the results. The area of 2X-2W is just 8.4% larger than that of 1X-2W but its power is 4.7 times larger. On the other hand, the area and power of 2P-2W are twice larger than those of 1X-2W. As the result, the P 3 of 2P-2W is 2.3 times better than that of 2X-2W. As for the performance per area (P 2 A), however, the multiprocessor (2P-2W) solution is inferior to 2X-2W. In the future leaky nanometer era, leakage power will be proportional to the area or the number of transistors. Therefore, it is indispensable to reduce the area while keeping performance. The proposed RSVP architecture reduces the area to share rarely-used parallel resources. Therefore performance per area of the RSVP becomes better than conventional VLIWs. These results are shown in the following section. 4), which consists of power from the IPUs, SPPEs and PFC.
Comparison among VLIW Processors and the Proposed RSVP
The performances of the RSVPs are a little bit inferior to those of VLIWs. The areas of the RSVPs, however, become much smaller and the P 3 are almost equivalent. The P 2 A of the RSVPs are always better than those of VLIWs as shown in Fig. 5 . The 2P-2W RSVP is 10% better in MIPS/mm 2 compared with the 2P-2W VLIW, while the 2P-3W RSVP becomes 14% better. The MIPS/mm 2 of the nP-4W RSVPs becomes better than that of VLIWs according to n, the number of independent parallel processors. The MIPS/mm 2 at the 4P-4W RSVP, for example, is 33% better than that of the 4P-4W conventional VLIW multiprocessors. Note that better P 3 will lead smaller leakage current in the future nanometer process as shown in the next section.
The area of the shared portion can be enlarged by migrating rarely-used instructions to SPPEs while keeping IPCs. Note that the above comparisons are estimated assuming that IPUs and SPPEs have the same execution units.
The areas in Table 3 includes the areas of PFCs, which becomes larger and larger according to the number of SPPEs and IPUs. However, the area of PFC of 4P-4W RSVP occpies just 4.2% (.34 mm 2 ) of the total area (7.93 mm 2 ). Even if the total number of processors is large, it is not mandatory to connect all IPUs and SPPEs by PFCs. It is because the number of concurrent instructions can be eliminated to four at most in general. The wider connection by the PFC will just increase area and power but give a little contribution to performance. 
Perspectives of Total Dissipating Power in the Nanometer Era
Estimations in the last section is done using a 180 nm process in which leakage current can be ignored. In the nanometer era, subthreshold and gate leakage currents become dominant. Dissipated power of an integrated circuit is described as Eq. (5).
The factor, I Leak increases according to the number of transistors, i.e. the circuit area. Table 4 summarizes several features of the 4P-4W RSVP and VLIW according to the process minimization. The values in the 90 nm process are obtained from a 90 nm actual high-speed library. They are scaled using the ITRS 2001 high performance model [7] . Figure 6 shows dynamic and leakage power of the 4P-4W VLIW and RSVP. From 180 nm to 90 nm, power de- creases very drastically since the active power decreases according to the transistor sizing with a little bit of increase of leakage power. But leakage power will be dominant according to the process minimization. Below the 90 nm process, power dissipations do not change drastically. The RSVP dissipates less power since it occupies smaller area than the VLIW multiprocessor. Figure 7 shows the performance per power (P 3 ) in MIPS/W of the RSVP and VLIW. The P 3 of the RSVP is almost equivalent to that of the VLIW in the 180 nm process. But it becomes much better according to the process minimization. In the 90 nm process, the P 3 of the RSVP is just 3.7% better than that of the VLIW. In the 25 nm process the P 3 of the RSVP will become 28.2% better. On the other hand, the performance per area (P 2 A) in MIPS/mm 2 increases at almost the same rate both on the RSVP and on the VLIW. The P 2 As of the RSVP are 32.8% superior to those of the VLIW in all the processes in Table 4 . 
Test Chip
We have fabricated a test chip of the RSVP that contains two IPUs and one SPPE in a 5.9 mm die using a 1P/5M 180 nm CMOS process. Table 5 shows the gate count for the IPU, the SPPE and the Pipeline & Forwarding Controller (PFC). The SPPE of the implemented LSI contains no multiplier by the area limitation. Figure 8 shows a chip micrograph and Table 6 shows its specification. The RSVP core is fully synthesized from an RTL SystemC code and place-routed automatically with conventional DA tools. The test chip is fully functional at 100 MHz clock frequency and its power dissipation is 130 mW including the RSVP core and data/instruction memory macros. Figure 9 shows a Shmoo plot.
Conclusions
The resource-shared VLIW processor (RSVP) increases the performance per area to share parallel hardware resources. The RSVP consists of independent processing units (IPUs) and shared pipelined processing elements (SPPEs). An IPU with the higher priority task always wins to get required number of SPPEs. Estimations from the ITRS 2001 high performance transistor models show that the 4P-4W RSVP with four IPUs and four SPPEs achieves 3.7% better performance per power compared with the conventional the 4P-4W VLIW multiprocessor in the current 90 nm process. In the nanometer era with more leaky transistors, leakage power will become dominant. The 4P-4W RSVP shows 28% better performance per power than the 4P-4W VLIW in the future 25 nm process.
We have fabricated a test chip in a 1P-5M 180 nm CMOS process that contains a 2P-2W RSVP. It is fullyfunctional and its power consumption at 100 MHz is 130 mW.
Parallel processing is frequently used to decrease frequencies and supply voltage while enhancing the performance. In general, prepared parallel processing resources are rarely used except for single-instruction multiple-data (SIMD) operations. These unused resources consume idle power in the future nanometer process. The proposed resource-shared VLIW processor architecture will give a solution for low-power in the nanometer era.
