Near fine grain parallel processing using a multiprocessor with by T. Abe et al.
Near ﬁne grain parallel processing using a multiprocessor with
MAPLE
T. ABE
￿
K. IWAI
￿
￿
￿
T. MORIMURA
￿
R. OGAWA
￿
K. YASUFUKU
￿
H. AMANO
￿
￿
Department of Computer Science, Keio University 3-14-1, Hiyoshi Yokohama, 223-8522, Japan.
asca@am.ics.keio.ac.jp
￿
￿
￿
National Defence Academy,
1-10-20 Hashirimizu Yokosuka, 239-8686 , Japan,
iwai@nda.ac.jp
Abstract
Multi-grain parallelizing scheme is one of effective par-
allelizing schemes which exploits various level parallelis-
m: coarse-grain(macro-dataﬂow), medium-grain(loop level
parallelizing) and near-ﬁne-grain(statements parallelizing)
from a sequential program. A multi-processor ASCA is
designed for efﬁcient execution of multi-grain parallelizing
program.
A processing element called MAPLE are mainly de-
signed for near-ﬁne-grain parallelism, and has two modules
calledMAPLE coreand DTC.The MAPLEcore isasimple
RISC processor which executes every operation in a ﬁxed
timeandrealize directregistertoregistertransfer. TheDTC
realize a software controlled cache by instructions which
are generated by the compiler. With a static scheduling,
near-ﬁne-grain parallel processing is efﬁciently performed
using a communication mechanism with receive registers,
and non-synchronization operation mechanism.
Through implementation of the prototype chip and
clock level simulation, it appears that the performance of
a single chip multi-processor with 4 MAPLEs is close to
those of modern super-scaler processors in spite of small
hardware and low clock frequency.
KEYWORDS: processor, cache, static scheduling, multi-
grain parallelism, parallel computing system
1 Introduction
Automatic parallelizing compilation schemes are important
forcommonprogrammerstosavetheireffortsforwritingthe
effectiveparallelizingcodeforthetargetmachine. Although
theseschemes are useful for variousmulti-processors, max-
imum performance will be obtained with a multi-processor
architecture whose processor, memory system, and inter-
connection network are tailored for the schemes. For this
purpose,wehaveproposedamulti-processorsystemASCA
(Advanced Scheduling oriented Computer Architecture).
ASCA system is designed for the multi-grain parallelizing
scheme [1], one of effective parallelizing schemes.
This scheme exploits parallelism from a sequential
program in various levels: coarse-grain parallelism(macro-
dataﬂow computation) [2], medium-grain parallelism(loop
level parallelism) and near-ﬁne-grain parallelism(statement
level parallelism) [3]. The former two types of parallelism
mainly concern with an interconnection called R-Clos and
total system of ASCA, while the processor core MAPLE
and dedicated cache called DTC are designed for the latter
near-ﬁne-grain parallelism.
Here, a processing element of ASCA which consists
of two chips: MAPLE processor core and Data Transfer
Controller is designed and implemented. Near-ﬁne-grain
parallel execution using multiple processing elements is e-
valuated.
2 ASCA multi-processor
2.1 Multi-grain parallel processing
A common compiler focuses on loop structures in a pro-
gram, anddetectsparallelismbetweeniterations. Itiscalled
loop-level parallelism or medium-grain parallelism. Al-
though this type of parallelism is efﬁcient in a class of
scientiﬁc programs, almost no performance enhancement
is expected for programs including complicated loop struc-
tures.
Forsuchprograms,acoarse-grainparallelprocessing[2]
whichusesaparallelismbetweenlargemodulesofprograms
corresponding to loops themselves and subroutines is often
effective. A near-ﬁne-grain parallel processing[3], which
uses a statement level parallelism in a program module, is
also effective to make the best use of inherent parallelism.
Multi-grain parallel processing[1] is a scheme which ex-
ploits above three levels: coarse-grain, medium-grain and
near-ﬁne grain.
Although this scheme is applicable to any parallel ma-
chines, architectural supports are required. For example, a
large program module called macro task must be assigned
dynamicallyaccordingtomacrodataﬂowgraphinacoarse-
grain parallel processing. On the other hand, high speed
and light weight communication and synchronization be-
tween processing elements are required for efﬁcient near-
ﬁne-grain parallel processing. In this level, if communica-
tion between processing element is completely scheduled
at the compile time, all synchronization codes can be e-liminated. For the non-synchronized execution mode, the
computation
￿ time and communication time of each hard-
ware block must be done in a constant time which can be
treated in the compiler. Thus, a dedicated multi-processor
architectureisrequiredforefﬁcientexecutionofmulti-grain
parallel processing.
2.2 ASCA Multiprocessor
ASCA multi-processor[4] has been developed as a project
supportedbySTARCtoestablisharchitecturaltechniquesin
a dedicated architecture for multi-grain parallel processing.
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
!
 
￿
"
$
#
￿
%
&
%
’
(
)
￿
*
+
,
 
￿
&
(
-
.
$
&
%
/
)
￿
*
0
￿
1
2
0
￿
1
2
0
￿
1
2
0
￿
1
2
Figure 1: Structure of ASCA
As shown in Figure 1, ASCA multi-processor consists
of multiple clusters which are connected with R-Clos in-
terconnection network[7][8]. A node of macro-data ﬂow
graph which exploits coarse-grain parallelism is assigned
into each cluster. Each cluster is a multi-processor system
consisting of dedicated processor core called MAPLE and
specialized cache which are designed so as to make the best
useofnear-ﬁne-grainparallelismaswellastraditionalloop-
level parallelism. Although a cluster will be implemented
as a single chip-multi-processor with the near future tech-
nology, a processing element is implemented on a board in
the ﬁrst prototype of ASCA. Now, three chips for key com-
ponents of the board are available[9], and a board including
a processing element is now under development.
In this paper, we focus on a near-ﬁne-grain parallel
processing in an ASCA cluster consisting of MAPLE pro-
cessors and dedicated cache systems. Other techniques
investigated in ASCA project are shown in our previous
papers[4][5][7][8][9].
3 Processing element MAPLE
3.1 Outline of MAPLE
MAPLE(MultiprocessorsystemASCAProcessingeLEment)
is a processing element in ASCA developed for near-ﬁne-
grain parallel processing. As shown in Figure2, it consists
ofa processor, Local Memory(LM), Communication Mem-
ory(CM) and Network Interface.
Network Interface
Processor Block
Core
DTC Cache
CM
LM
To R-Clos
Processing Element: MAPLE
INT pipe
FP pipe
Figure 2: Structure of processing element MAPLE
F.P.
Inst.
Decode
F.P.
Exec.
F.P.
MEM.
Access
F.P.
Write
Back
INT.
Inst.
Decode
INT.
Exec.
INT.
MEM.
Access
INT.
Write
Back
Inst.
Fetch
Figure 3: Structure of Pipeline
AprocessorisfurtherconsistingofMAPLEcore, Data
Transfer Controller(DTC) and cache.
Forefﬁcientnear-ﬁne-grainparallelprocessing,MAPLE
provides the following facilities:
3 predictable operation time for static scheduling,
3 receive registers for quick interprocessor communi-
cation,
3 light weight barrier and non-synchronization mode
for eliminating common synchronization codes, and
3 software controlled cache managed with the DTC.
3.2 MAPLE Core
MAPLE core is a 32-bit RISC processor which provides
a simple structure with highly predictable operations. It-
s instruction set is an extension of that of DLX[6]. Like
DLX, it has 32 registers each for integer and ﬂoating point.
As shown in Figure 3, it has a 5-stage pipeline structure
and every operation can be executed in a ﬁxed clock cy-
cles. The ﬂoating-point execution unit is fully pipelined
except the divider, and supports 32-bit/64-bit IEEE std 754-
1985. Dynamic optimization techniques like dynamic in-
struction scheduling are excluded so as to enable precise
static scheduling. Out of order execution/completion is
only allowed when the execution time of instructions are
predicted.3.3 Mechanisms for near-ﬁne-grain process-
ing
3.3.1 Receive registers
MAPLE provides two types of the fast data transfer mech-
anism.
For a large size data transfer, MAPLE requests the
DMAcontroller whichcanload the requested data indepen-
dently from the MAPLE-core operations. The ﬂag on the
local DSM is used to indicate whether the requesting data
transfer is completed. This data transfer can exploit a high
bandwidth, while it takes a considerable time for setting up.
On theotherhand, for asmallsizeone worddatatrans-
fer required in the near-ﬁne-grain processing, MAPLE has
special transfer operations and dedicated registers called
receive registers to achieve direct register-to-register trans-
ferring.
Figure 4 illustrates the data transfer using receive reg-
isters.
RR
MAPLE
GPR
Network Interface
COMM
LM
To R-Clos Network
(receive-register exclusive crossbar)
(source)
PE
PE PE (dest.)
PE
GPR
RR
MAPLE
Network Interface
COMM
LM
To R-Clos Network
Figure 4: receive register(RR)
16 32-bit receive registers (RR in Figure 4) are pro-
vided in the ID-stage of the pipeline, and when the source
processor executes a transfer operation, data in a general
purpose register in the source processor is directly sent to a
receiveregisterofthedestination processorthroughacross-
bar. In order to detect the arrival of the required data, there
are tag (valid) bits on receive registers each of which is set
when the data is received, as shown in Figure 5.
Withareceivinginstruction,thepipelineisstalledifthe
valid bit is not set. Otherwise, the data is moved from the
receive register to general purpose registers immediately,
and the valid bit is reset. For treating receive registers,
MAPLE provides two instructions: sendr(sendri) for
sending and rcvr (rcvri) for receiving.
Although this mechanism can avoid the read-after-
writeproblem, write-after-readproblemisnotresolved,that
is, the data may be overwritten by a new data before read-
ing. In the near-ﬁne-grain parallel processing of ASCA,
this problem is solved with the static scheduling or a light
weight barrier mechanism.
In the ﬁrst prototype chip of MAPLE, each receive
register supports only one word transferring because of the
pinlimitation. However, it is sufﬁcientin most data transfer
in the near-ﬁne-grain parallel processing.
Compared with a common shared register used in s-
ingle chip multi-processors Sun Microsystems’ MJAC or
NEC’s MP98[11], receive register is loosely coupled ap-
proach and so easy to implement. However, the perfor-
mance is not degraded, since the synchronization is partly
. .
. . .
.
0
4
31
5
v rr data
32
5
Receive Register File
6
mux
read_rrd
demux
7 write_rrd
r
r
_
w
r
i
t
e
rr_read
stall_for_rr
mux
read judge
rr_data_out
rr_data_in
invalid signal
8
v
a
l
i
d
 
s
i
g
n
a
l
To ID stage
9
From R−Clos
:
Figure 5: Structure of receive register
combined.
3.3.2 Light weight barrier and Asynchronous opera-
tion mode
In the near-ﬁne-grain parallel processing, if static schedul-
ing is completely successful, all synchronization codes can
be omitted. For this purpose, any optimization techniques
which require undeterministic behavior are eliminated in
MAPLE core. The memory access is also designed to be
deterministic using the software cache supported by the
DTC described in the next section. Even though, the net-
work congestion will cause the situation that the prefetched
data is not loaded into the cache in time. To cope with this
problem, MAPLE has a light weight barrier mechanism,
and two operation modes: synchronous/asynchronous.
Each instruction of MAPLE has a few bit synchro-
nization tags, and the light weight barrier synchronization
mechanism consisting of a simple open-drain bus is driven.
In the synchronous mode, this mechanism is enabled, and
processors are stalled until all processors in the cluster exe-
cutes instructions with the same tag. Using this light weight
barriermechanism. processorscanbesynchronizedwithout
executing instructions dedicated for synchronization.
Two modes in MAPLE is switched as follows using
this light weight barrier.
3 Usualscheduledcodesareexecutedintheasynchronous
mode. In this mode, the light weight barrier is dis-
abled.
3 If a processor detects a undeterministic situation (eg.
cache miss), it changes its mode into synchronization
mode.
3 In the synchronization mode, the light weight barrier
is enabled, and when all processors are synchronized,
the mode returns to the asynchronous mode.
When a cluster of MAPLEs work in this asynchronous
mode, it can be treated as a loosely coupled VLIW proces-
sor.3.4 The Data Transfer Controller
TheDTC is an intelligent controller which hides the latency
for accessing both the shared and local memory. It is also
designed suited for the multi-grain parallelizing scheme.
For the coarse-grain parallelism, a large data set trans-
fer of Macro Task (MT) will become a critical overhead. If
the transfer of MT data set is completed until the start-up
phase of MT, the overhead can be completely hidden. Al-
though it is difﬁcult to be done, the DTC tries as much as
possible according to the scheduled code by the compiler.
Inthiscase, blocktransferusingtheDMAisrequestedfrom
the DTC.
On the contrary, since in the near-ﬁne-grain parallelis-
m, frequent communications with a lot of synchronization-
s between processors will dominate the performance, we
adopt a precise static scheduling for the block which does
not involve runtime decisions to eliminate these synchro-
nizations. However, in this scheme, an uncertain factor,
hit or miss-hit of cache, spoils the precise effective static
scheduling. To cope with this problem, a software con-
trolled cache system by the DTC is essential. In the system,
data loading and replacement of the cache lines are mainly
controlled by the scheduler’s generating code so as to re-
alize the always-hit-cache system except for special cases.
Also as cache lines are controlled by the scheduler, full-
associative scheme is implemented with a small amount of
hardware.
TheDTCisasimpleprocessorwiththree-stage-pipeline
and has three control modes: software cache, hardware
cache + preload/poststore, and hardware cache only. In the
software cache mode, the DTC executes instructions gen-
erated from the static scheduler. In order to prepare the
required data for cache memory before using it, the static
scheduler calculates the latency of data transfer and gener-
ates the DTC code with main processor’s code that invokes
the DTC instructions. Since the main processor has a ﬁve-
stages-pipeline with out of order completion, the precise
behavior of the processor is inspected by a pipeline simu-
lator included in the scheduler software. When the static
scheduledsoftwarecachemodeisbrokendownbysomeun-
certain factors which could not predict in the scheduler, the
DTC changes its mode from software cache into hardware
cache. After that, the cache behaves as a common hardware
controlled cache.
3.5 Operations of the DTC
Maple
Core
Tag Memory
Data
Cache
Memory
Instruction
Cache
Memory
Maple
Instruction
Dtc
Instruction
Data Addres
hit/miss
Data
Data
Data Address
Ack
Tag
Network
Address
LM/DSM
Address
Network Interface
Local
Memory
&
DSM
cache address r/w
r/w
lac/sac
DMA Cache Controller
DTC Processor
cache 
address
Data 
Addres
DTC
Run_Signal
Figure 6: The Structure of Cache
As shown in Figure 6. the DTC consists of a DTC
Processor, Cache Controller and Tag memory.
DTC Processor is a simple 64-bit processor withthree-
stage pipeline and has four instructions to control transfer-
ring to/from the cache memory. An instruction is executed
by receiving a control signal from MAPLE, and the time
when the control signal is issued is buried with MAPLE
instructions generated by the compiler. If the DTC instruc-
tion is the data transfer operation, it sends a request to the
Cache Controller. Although the loading and replacing data
are triggered by the DTC instruction, the operation itself
is executed in the Cache Controller. Once triggered, the
Cache Controller manages the data transfers between mem-
ory systems (DSM,CSM and LM).
When the scheduler in the compiler judged that the
software cache is not effective, the cache can be also used
asacommon4-wayhardwarecontrolledcache. Inthiscase,
the DTC behaves a simple prefetch controller.
Figure 7: The Structure of Software Cache Controller
Figure 7 shows a part of software cache controller.
Right half part of ﬁgure shows data ﬂow between the cache
memory, the local memory and the external memory. The
rest of ﬁgure mainly showsdata ﬂowbetween MAPLE core
and the cache memory for judging the effectiveness of soft-
warecache. In the softwarecache control mode, all of these
data ﬂows are controlled by the DTC instructions generated
by static analysis of the compiler.
The DTC Processor has four instructions: Load Ad-
dress Conversion (LAC), Store Address Conversion (SAC),
PreLoad (PL) and PostStore (PS). These instructions work
as follows:
3 PL prefetches data from a local/shared memory to a
cachememorythroughMemoryAccessController(MAC),
and writes the entry in the tag memory at the end of
preloading.
3 PS transfers data from a cache memory into a lo-
cal/shared memory through the MAC,and deletes the
entry in tag memory at the end of operation.
3 LAC converts from a cache address into a memory
address through Address Converterand push the both
addresses into the FIFO. Since a valid data in corre-
sponding to the cache address stored in the head of
the FIFO is on the data bus, the acknowledge to load
instruction from MAPLE core requires just one clock
cycle when MAPLE core issues a load instruction.
The memory address in the head of the FIFO is com-
pared with the memory address of load instruction
from MAPLE core and the result is used for judging
whether the software cache is broken out.3 SAC operates in the same manner with LAC until
queuing. The data from MAPLE core is stored in the
cache address speciﬁed by the SAC instruction when
MAPLE core issues a store instruction.
As long as running on a software cache control mod-
e, this system can realize the most efﬁcient cache utiliza-
tion, data localization and quick cache access based on the
static analysis. Though our goal is that this analysis coin-
cides on real executionperfectly by implementing a proces-
sor(MAPLE core) and network switches(R-ClosII) tailored
by the static analysis of the compiler, some exceptional
dynamic determination still exists. If the comparison of
a memory address is false, this cache system behaves as
general hardware controlled cache after that.
4 Prototype Implementation
Although four or ﬁve processing elements corresponding to
a cluster is implemented in a single chip in the near future,
a prototype MAPLE is implemented with two prototype
chips: MAPLE core and DTC chip.
4.1 MAPLE Core
The MAPLE core chip is implemented on Rhom’s 0.35
; m
CMOS cell-based LSI. Libraries are supported by VDEC
Japan. About80%ofgatesareusedfortheﬂoatingpipeline,
and receive registers and light weight barrier mechanism
requiresonly6000 gates. Although rather conservativepro-
cess is used, it works at 80MHz clock.
Figure 8: The Speciﬁcation of MAPLE Core
Chip Rohm, CMOS 0.35um
poly 2 Metal 3
Maximum clock 80MHz
Gates 174010
The number of pins 466
Figure 9: The Layout of MAPLE
Figure 9 shows the layout of the prototype MAPLE
chip. Since the required hardware can be reduced, rather
small chip area is occupied in real gates.
4.2 DTC
The speciﬁcation ofDTC chip is shownin Table1. Theﬁrst
DTC chip was implemented on 0.35
; Hitachi Gate Array
also supported by VDEC. Since it is a prototype chip with
a small amount of gates, a small off-chip cache memory
(8K byte) is assumed, and 64 word
< 27 bit
< 4 way tag
memory is mounted on the chip. However, the computer
simulation results show that the software cache supports
better performance compared with hardware cache[10].
Table 1: The Speciﬁcation of DTC
Chip Hitachi, 0.35um Gate Array
Feature poly 1 Metal 5 190 pins
143k gates
Maximum clock 81.97 MHz
Logics 35,761 BC(7,1522 gates)
Area utilization 50.02 %
The number of pins 185
5 Performance estimation
Performance of single processor The performance of a
single MAPLE core is estimated with benchmark program-
s (FFT and FLOPS) using the clock level logic simulator.
FLOPS includes eight subprograms. Each subprogram cal-
culates numerical integrationor Maclaurin series expansion
indoubleformat. Figure10showsthereciprocalofFFTex-
ecution time and MFLOPS value at the subprogram which
includes 3.4% DIV instruction.
20
=
10
5
>
0
?
R3000(25)
SPARC(40)
PA−7100(64)
SuperSPARC(50)
HyperSPARC(125)
FFT FLOPS
15
F
F
T
 
e
x
e
c
u
t
i
o
n
 
t
i
m
e
^
(
−
1
)
 
[
1
/
s
e
c
]
 
/
 
M
F
L
O
P
S
 
@
$
A
C
B
C
D
￿
E
G
F
H
￿
I
￿
J
Figure 10: Evaluation of FFT and FLOPS
Processing capacity of MAPLE is comparable to the
early super-scalar processors even though it was used as a
simple single processor system.
Performance of a cluster (4 MAPLES) Here, we ana-
lyzed the performance of a cluster with four MAPLEs when
an application program called "Picalc" is executed in the
near-ﬁne-grain parallel processing. Considering the hard-
ware requirement, a cluster can be integrated on a chip and
comparable to recent high performance microprocessors.
Notice that every communication between PEs uses direct
register-register data transfer mechanism of MAPLE.
Picalc is a series calculation program to ﬁnd the value
of
K with many loop iterations.0.05
0.10
0.15
0.20
1 2 3 4
-O 0 option
-O 3 option
Number of Processor
Calc. time[sec]
1
2
3
4
1 2 3 4
-O 0 option
-O 3 option
Number of Processor
Speed up
Figure 11: Speedup of execution Picalc
Figure 11 shows speedup against the number of PEs.
With compiler’s optimization options, useless codes are re-
moved so that large speedup rate is obtained and the per-
formance of 4 PEs is 2.25 times higher than that of single
PE.
Figure 12: Performance of a cluster
Figure12demonstratesthattheperformanceiscompa-
rable to recent high performance processors when a cluster
is pushed into a single chip. Note that the power consump-
tionismuchreduced, sincetheclockfrequencyismuchless
than those of recent microprocessors.
6 Conclusion
The processor architecture dedicated for efﬁcient execution
of near-ﬁne-grain parallel processing is proposed, imple-
mented and evaluated. Performance evaluation based on a
real design shows that the performance of a cluster consist-
ing of MAPLE processors is comparable with recent high
end super-scalar processors in spite of its simple structure
and low frequency operation.
Now, two prototype chips described in this paper:
MAPLE core and the DTC are available. The print circuit
boardwhichmountsthesechips,memory,andinterfacesare
now under developing. Using the boards, a multi-processor
corresponding to a single cluster can be built. Simulation
studies of multi-cluster systems including static scheduler
are also our future work.
References
[1] Okamoto,M.,Yamasita,K.,Kasahara,H.andNarita,S.,
“Hierarchical Macro-Dataﬂow Computation Scheme
on a Multiprocessor System OSCAR”, Proc. IEEE
Paciﬁc Rim Conference on Communications, Com-
puters, and Signal Processing, pp.44-49, May. 1995.
Scheme of Fortran Programs
[2] ConstantinePolychronopoulos,MilindB.Girkar,Mo-
hammad R. Haghighat, Chia L. Lee, Bruce P. Le-
ung, Dale A. Schouten “Parafrase-2: An Environ-
ment for Parallelizing, Partitioning, Synchronizing,
and Scheduling Programs on Multiprocessor”, Pro-
ceedings of the International Conference on Parallel
Processing, St. Charles IL, August 1989, pp. II39-48
[3] Ogata,W., Fujimoto,K., Oota,M. and Kasahara,H.,
“Compilation Scheme for Near Fine Grain Parallel
Processing on a Multiprocessor System without Ex-
plicit Synchronization”, Proc. IEEE Paciﬁc Rim Con-
ference on Communications, Computers, and Signal
Processing, pp.327-332, May. 1995
[4] K.Iwai, T.Morimura, T.Fujiwara, K.Sakamoto and
H.Amano, “An interconnection network of ASCA:
a multiprocessor for multi-grain parallel processing”,
Proc. of 6th IASTED Symposium on Applied Infor-
matics, pp.255-257, Feb.1998
[5] T.Fujiwara, T.Kawaguchi, K.Sakamoto, K.Iwai,
H.Amano, “Custom Processor for the Multiprocessor
ASCA”,Proc. of 6th IASTED Symposium on Applied
Informatics, Feb. 1998
[6] John L. Hennessy and David A. Patterson, “COM-
PUTER ARCHITECTURE A QUANTITATIVE AP-
PROACH SECOND EDITION”, Morgan Kaufmann
Publishers, 1996.
[7] Tomohiro Morimura, KeisukeIwai, Hideharu Amano,
“Multistage Interconnection Network Recursive-
Clos(R-Clos) : Emulating the hierarchical multi-bus”,
PDCS ’98, pp.99-104, Sep.1998
[8] Tomohiro Morimura, Kensuke Tanaka, Keisuke Iwai,
Hideharu Amano, “"Multistage Interconnection Net-
work Recursive-Clos(R-Clos II) : a scalable hierar-
chical network for a compiler directed multiprocessor
ASCA”, PDPTA 2001
[9] T.Abe,T.Morimura,T.Suzuki,K.Tanaka,M.Koibuchi,
K.Iwai, H.Amano, “ASCA chip set: Key components
of multiprocessor architecture for multi-grain parallel
processing”, Proc. of COOL Chips IV, pp.223-247,
Apr.2001
[10] K.Iwai, T.Morimura, T.Kawaguti, A.Sakai, T.Abe,
H.Amano, “ASCA: A multiprocessor architecture”,
JSPP2000, pp.2-10, June.200
[11] http://www.labs.nec.co.jp/MP98/,“MP98 Projectweb
page”