SHAP — Scalable Multi-Core Java Bytecode Processor by Zabel, Martin & Spallek, Rainer G.
SHAP — Scalable Multi-Core
Java Bytecode Processor
Martin Zabel
Rainer G. Spallek
Institut fu¨r Technische Informatik
TUD-FI09-13 Dezember 2009
Technische Berichte
Technical Reports
ISSN 1430-211X
Fakultät Informatik
Technische Universität Dresden
Fakultät Informatik
D−01062 Dresden
Germany
URL: http://www.inf.tu−dresden.de/
SHAP — Scalable Multi-Core
Java Bytecode Processor
Martin Zabel Rainer G. Spallek
Institut fu¨r Technische Informatik
Technische Universita¨t Dresden
01062 Dresden, Germany
{martin.zabel, rainer.spallek}@tu-dresden.de
Abstract This paper introduces a new embed-
ded Java multi-core architecture which shows
a significantly better performance for a large
number of cores than the related projects
JopCMP and jamuth IP multi-core. The cores
gain fast access to the shared heap by a full-
duplex bus with pipelined transactions. Each
core is equipped with local on-chip memory for
the Java operand stack and the method cache to
further reduce the memory bandwidth require-
ments. As opposed to the related projects, syn-
chronization is supported on a per object-basis
instead of a single lock. Load balancing is imple-
mented in Java and requires no additional hard-
ware. The multi-port memory manager includes
an exact and fully concurrent garbage collector
for automatic memory management. The de-
sign can be synthesized for a variable number
of parallel cores and shows a linear increase in
chip-space. Three different benchmarks demon-
strate the very good scalability of our architec-
ture. Due to limited chip-space on our evalua-
tion platform, the core count could not be in-
creased further than 8. But, we expect a smooth
performance decrease.
1 Introduction
Object-oriented programming has led to fast
and easy development of complex applications
with a short time-to-market. In this domain,
Java is very popular as it addresses portability
and security features by the standard definition
of the Java Virtual Machine (JVM). As more
and more target systems implement the JVM,
the same application can be executed anywhere.
Another important feature is the compact Java
bytecode leading to small memory requirements
and reduced download times. So, it is predes-
tined for application in resource-constrained de-
vices.
Because execution of Java bytecode by in-
terpretation is known to be rather slow, Just-
In-Time (JIT) compilation has been used to
translate Java bytecode to the host processor’s
native instruction set. As this requires much
memory, the alternative of executing Java byte-
code natively has been considered for embed-
ded Java implementations. But, because the
required computational performance is rising
even in the embedded market, the performance
of these Java bytecode processors must be in-
creased, too.
The single-thread performance can be im-
proved by smarter pipelines. Instruction-level
parallelism is typically exploited by bytecode
folding which merges several Java bytecodes
into a single RISC-like instruction (e.g. [1–3]).
For multimedia applications, also VLIW pack-
ets [4] as well as bit-level parallelism [5] have
been researched.
The Java language as well as the Java API
give fine support for multi-threaded applica-
tions. The resultant thread-level parallelism is
exploited by multi-threaded and multi-core ar-
chitectures. Even single-threaded applications
2 SHAP — Scalable Multi-Core
can benefit here, because additional threads are
required for system tasks such as garbage collec-
tion and controlling of I/O units. For example,
the application can utilize the full throughput of
the first core while the system tasks are running
on the second core.
This paper focuses on building up an embed-
ded scalable multi-core system with an arbi-
trary number of parallel Java bytecode cores.
Whereas related projects reported only average
speed-up with increasing core count work, this
paper will show that a pretty good performance
is also achievable with many cores. Unfortu-
nately, to due limited chip-space on the eval-
uation platform, measurements could only be
performed for up to 8 cores.
Sec. 2 explores the design space and related
work is reviewed in Sec. 3. A concrete imple-
mentation is pointed out in Sec. 4. Benchmark
results as well as comparisons to related multi-
core systems are given in Sec. 5.
2 Design Space Exploration
Due to the nature of the Java Virtual Machine
[6], Java applications cannot directly address
the memory. Instead, several code and data ar-
eas are accessed “implicitly” depending on the
function, cf. [7]:
Heap: The heap stores all data objects (incl.
arrays) allocated by the Java application.
The objects are freed automatically by the
garbage collector (GC) if they are not ref-
erenced any more. As all threads of an ap-
plication work together on the same set of
objects, a shared memory approach is sen-
sible.
Stack: Furthermore, each thread has its own
operand stack. A thread can only access
his stack, thus, local memory is sufficient.
Code: The bytecode of the Java methods —
together with associated information, like
constant pool, exception handler— is ei-
ther stored in special code areas or, alterna-
tively, in special class objects also allocated
on the heap. Either local or shared memory
is suitable as discussed below. Static class
members must be read-/write-able from all
threads, and should therefore be stored on
a shared memory, e.g., the heap.
I/O: Access to the I/O devices is only possi-
ble through native methods. In a desk-
top JVM, they wrap the underlying oper-
ating system functions. In embedded Java
processors such as JOP [8] and SHAP [9]
the operating system is omitted. Instead a
small set of native functions give access to
the I/O address space, and all other func-
tionality is implemented in Java as well.
This partitioning gives some flexibility on con-
necting the multiple Java bytecode cores.
To maximize performance, the connection of
the heap, stack and code area to the multi-
ple cores is very important. The I/O access
can be neglected in the way, that if high data-
throughput is required, the I/O device should
get direct memory access to the heap area. For
“slow” devices, programmed I/O over a shared
bus fabric is sufficient.
During bytecode execution, the code area is
accessed frequently. A local memory approach
seems to be a good choice, but then the code
must be duplicated, consuming a lot of mem-
ory. In embedded systems, a shared memory
together with instruction (bytecode) caches is a
better approach with respect to chip space and
leakage power consumption. On the other hand,
the performance is degraded, especially when
the code is stored in class objects on the heap as
well. An efficient caching algorithm must han-
dle this issue.
More aspects must be regarded when connect-
ing the cores to the Java heap. As mentioned
above, a shared memory approach should be
used. Of course, this might be implemented
as uniform or non-uniform memory architec-
ture (UMA/NUMA) depending on the latency
requirements for memory accesses. Second,
atomic operations must be supported, so that
locks /monitors can be implemented for entry
and exit of critical sections. These operations
should be fast, so that commonly used resources
can be occupied and freed in less time.
Third, data caches might be useful to hide the
memory access latency as well as to reduce the
required bandwidth on the memory backplane.
Java Bytecode Processor 3
Apart from the required chip-space and leak-
age power, snoop logic in combination with a
MESI /MOESI protocol must be implemented
too, so that cache lines are kept up-to-date when
one core writes to its cache. If data caches are
omitted, then each memory access results in a
bus transaction. Hence, the bus should support
pipelined transactions with outstanding reads,
like in the AMBA AXI interconnect [10]. Oth-
erwise, the blocking of other cores until the read
data is returned would massively degrade the ef-
fective bandwidth.
When using a non-uniform memory architec-
ture a distributed garbage collector is a good
choice, so that each GC has fast access to its
own part of the heap. One implementation for
distributed object-oriented systems is described
by Gupta and Fuchs [11].
As mentioned above, local memory is suffi-
cient to hold the stack which is also accessed
frequently. This memory might be implemented
as a cache, so that threads could be easily moved
to other cores by stack filling and emptying. No
complex cache coherence is required, as each
thread can only access its stack. On the other
hand, memory must be reserved to store the
complete stack upon stack emptying.
In contrast, the stack could be stored in local
memory only to save heap memory. Now, band-
width on the memory backplane is not required,
but additional hardware is required instead for
moving stacks between cores.
Beside the mentioned memory bandwidth
and thread synchronization, thread scheduling
is another challenging aspect. Threads must
be properly distributed over the several cores,
so that the work load is well-balanced. The
thread’s core affinity plays a big role, because
it influences the memory transaction’s latency.
When moving threads on UMA systems from
one core to another, then only the cache of the
target core must be reloaded. Note, all memory
transfers between cache and the memory have
the same latency (and throughput) independent
of the core.
On NUMA systems instead, the memory la-
tency increases because the data must be trans-
fered to / from a memory which is connected to
the other core and only accessible through ad-
ditional interconnection. Moving the objects to
memory associated with the target core too, is
an option.
Scheduling algorithms might be implemented
in hardware and / or software. These algorithms
are an essential part of the research for operat-
ing systems, and, thus, beyond the scope of this
paper.
3 Related Work
To measure the performance of multi-core ar-
chitectures the following two metrics are used
throughout the paper. The speed-up is calcu-
lated by comparing the runtime on the single-
core t1 with the runtime on n cores tn:
Sn = t1/tn . (1)
The efficiency sets the speed-up gained in re-
lation to the costs to run the multi-core system.
Typical costs involved are chip-space and power
which increase almost proportional to number
of cores n. Thus the efficiency is defined by
En = Sn/n . (2)
Chip-level multiprocessing using Jop cores is
introduced by the JopCMP system [12]. A vari-
able number of Jop cores are connected through
a SimpCon bus with the I/O subsystem and the
shared memory. The bus is arbitrated using a
“fairness-based arbiter”. The shared memory
holds the heap and code area of the JVM and
is located in external memory which is accessed
through an on-chip memory controller. Each
core has its own instruction cache and its own
stack cache for the currently running thread. A
synchronization unit provides only one global
lock for the implementation of monitors. This
disadvantage must be considered during appli-
cation design to avoid deadlocks.
The JopCMP has only a small efficiency even
for fully paralleled application without any ac-
cess to shared data structures (and without
synchronization). The Lift benchmark shows
speed-ups of only 1.8 for 2 cores, 2.4 for 4 cores
and 2.5 for 8 cores. The gives marginal effi-
ciencies of 60% for 4 cores and 31% for 8 cores.
The main bottleneck seems to be the SimpCon
bus arbiter, which does not support pipelined
4 SHAP — Scalable Multi-Core
transactions. Each read of a single memory
word blocks the bus for 3 cycles, effectively
dividing the available bandwidth by 3. Data
caches are also not available. We also disagree
that a “speedup logarithmic to the number of
cores would satisfy future processing demands”
as concluded in the above reference.
Uhrig evaluated a multi-core system using the
jamuth IP core [13]. The multiple cores are
connected to the shared memory through an
Avalon bus which supports pipelined transac-
tions. Each core has its own instruction cache
but no data cache. The jamuth IP core is multi-
threaded in the sense that, instructions for up
to 4 cores can be fetched and decoded in par-
allel. An embedded hardware scheduler then
assigns one thread to the shared operand fetch
and execution unit. For each thread, a separate
stack cache is included. For connection of the
peripherals, a separate Avalon bus is provided.
Here again, only a single lock is provided for
synchronization.
The system is measured only for up to 3 cores.
When running only one benchmark thread per
core, the “relative performance” (equals effi-
ciency) is about 92% for 2 cores and about 86%
for 3 cores. (These numbers have been extracted
from Fig. 4 of the cited paper.) This would give
speed-ups of 1.84 and 2.58, respectively. The
benchmarks only allocate memory in the initial-
ization phase, so that the GC thread (running
on each core) should be idle for most of time.
Why the performance drops so fast, is unclear.
4 Implementation & Scalability
The SHAP micro-architecture consists of four
main parts:
CPU: The CPU natively executes Java byte-
code utilizing on-chip memory for the
Java operand stack, with a typical size of
8KByte. The memory module holds a vari-
able number of operand stacks facilitating
fast switching between threads running on
the same core [9]. The bytecode is exe-
cuted without an underlying OS only with
the help of microcode. The “native” func-
tions of the CLDC Java API (like thread
handling) are either implemented as Java
methods or as microcode.
The heap is accessed through the memory
manager, described further down. The I/O
devices are accessed through a dedicated
Wishbone bus.
Method Cache: As the CPU frequently ac-
cesses the method’s bytecode an instruction
cache is attached. This method cache uses
a stack principle and has a typical size of
2KByte.
Memory Manager: The memory management
unit (MMU) handles object allocation, ac-
cesses, and deletion. It also integrates
a concurrently running automatic garbage
collector which scans and compacts the
heap memory in parallel to the Java ap-
plication.
Memory Controller: To reduce memory access
latency and save chip-area for caches,
the memory controller for external SRAM
or SDRAM is integrated into the micro-
architecture.
To build up a multi-core Java bytecode pro-
cessor, several issues have been addressed. The
simplest task was the extension of the I/O bus,
as the Wishbone bus protocol [14] already ad-
dresses multiple bus-masters and arbitration. A
shared bus with a round-robin arbitration has
been favored over a crossbar to save chip-area
on the FPGA development platform.
The UMA principle has been chosen for the
shared heap memory stored in external mem-
ory. All cores access the same heap memory
through the one-and-only memory manager be-
cause measurements on the single-core show,
that the available memory bandwidth is only
used by 8 (compute intensive) to 16% (memory
intensive applications). The numbers include
both data and code access, the latter already
“filtered” by the method cache. This num-
ber is in the normal range, as analysis of Java
bytecode shows that only 23% of the executed
bytecodes already access the heap memory [15].
Stack operations do not count, as these read
and write from the core’s local memory. Conse-
quently, the memory bandwidth should be suf-
Java Bytecode Processor 5
SHAP Multi−Core Architecture
Garbage
Collector
Manager
Memory
32
Co
nt
ro
lle
r
32
UART
Graphics Unit
Ethernet MAC − SRAM
− DDR−RAM
Memory
DDR: 16
SDR: 32
DMA
Ctrl
Method
Cache
32
8
32Data
Code
Method
Cache
Stack
Core1
Stack
Core0
Method
Cache
Stack
Core n−1
configurable
W
is
hb
on
e 
Bu
s
Figure 1: SHAP Multi-Core Architecture using n Cores
ficient for 6 to 12 cores. The setup for n − 1
cores is depicted in Fig. 1.
The memory manager already had a free con-
figurable number of ports to support indepen-
dent channels for code and data as well as direct
memory access (DMA) for I/O devices. Each
additional core simply requires two additional
ports. Because all cores use the same MMU,
the GC need not be distributed. All memory
accesses issued through the ports as well as gen-
erated by the MMU itself are arbitrated inter-
nally. To allow a good bandwidth utilization,
the bus between the MMU and the integrated
memory controller is full-duplex and all trans-
actions are pipelined.
SHAP supports synchronization on a per-
object basis, as defined by the standard JVM.
To provide fast atomic locking and unlocking,
the corresponding lock state information is up-
dated inside the memory controller. The up-
date process requires two simple operations:
owner comparison and lock depth increment/
decrement, which can both be calculated in a
single cycle. Together with the read-out of the
old state and the write-back of the new one, the
operation requires four clock cycles.
Each core has its own attached stack mem-
ory. Currently, this dual-port memory can only
be accessed by the core itself or by the GC. Ad-
ditional memory ports are too chip-space inten-
sive even for ASICs. Depending on the appli-
cation, the GC port is not used all the time.
Using the free bandwidth, to move the thread’s
stack from one core’s stack memory to another
is an option and might be implemented in the
future. Thus, the current implementation adds
a serious constraint for the scheduler. Thread’s
are fixed to the core where they are started on.
The system is booted using Core 0. Its mi-
crocode loads the boot-loader Java class (from
an I/O device like UART or USB) and calls the
contained method “loadup”. This method, al-
ready implemented in Java, loads the rest of the
Java API into the heap memory. It also initial-
6 SHAP — Scalable Multi-Core
Table 1: Chip-Space in Dependence on Core Count
Cores LUTs FFs 18 kbit 36 kbit 18x18
BRAMs BRAMs Multipliers
1 5993 3474 7 2 6
2 8658 4866 13 3 9
3 11369 6253 19 4 12
4 14001 7642 25 5 15
5 16787 9029 31 6 18
6 19500 10415 37 7 21
7 22318 11802 43 8 24
8 24969 13191 49 9 27
LUTs(n) ≈ 3218 + 2718 ∗ n
FFs(n) ≈ 2089 + 1388 ∗ n
18 kbit BRAMs(n) = 1 + 6 ∗ n
36 kbit BRAMs(n) = 1 + n
Multiplier(n) = 3 + 3 ∗ n
izes all other cores by creating a special control
thread object per core. The object references
are written to special registers via the I/O bus.
Each core reads from its own register and starts
the specified thread using microcode. The con-
trol thread observes a queue, so that any thread
can start new threads on any core.
The control threads are used to implement
a simple load balancing only using Java code.
First, the threads already running on the same
core are scheduled using a round-robin scheme.
Thus, the control thread can determine the load
on its core by simply measuring the time be-
tween two wake-ups. The information is stored
in a table shared by all cores. New threads are
started either a) manually on a specific core us-
ing a special API function, or b) automatically
on an appropriate core using the standard API
method. The latter method scans the load ta-
ble, to find the core with the less actual load. In
both cases, the thread-to-start is inserted into
the queue of the target core.
Tab. 1 denotes the chip-space required on a
Xilinx Virtex-5 FPGA (XC5VLX50T FF1136-
1) for up to 8 cores. As can be seen, the chip-
space increases almost linear with the number
of cores but also includes a large fixed amount
dedicated to the memory manager. The fixed
amount used by I/O devices is negligible as all
configurations only use the minimal set of I/O
devices: a UART. Thus, after dividing the chip-
space of the memory manager, the chip-space
per core is decreasing when more cores are used.
All configurations run at a speed of 66MHz
with only automatic place and route. For the
lower core counts, the timing analyzer tool re-
ports a maximum frequency of up to 80MHz.
This has not been checked in reality. On the
other hand, the maximum frequency might also
be increased for high core counts by manually
assigning placement regions to assist the place
and route tool.
5 Benchmarks
The multi-core —2 up to 8 cores— and single-
core runtimes have been measured for appli-
cations from three different domains: matrix
multiplication, linear search and a script inter-
preter. The resulting speed-ups and efficien-
cies are depicted in Fig. 2 and Fig. 3, respec-
tively. The numbers are calculated using equa-
tions Eq. (1) and Eq. (2).
Java Bytecode Processor 7
 1
 2
 3
 4
 5
 6
 7
 8
 1  2  3  4  5  6  7  8
Sp
ee
du
p 
S
Number of Cores n
Linear Search
Matrix Multiply
FScript Test
linear speed-up
Figure 2: Speed-up on the ML505 Evaluation Board
The matrix multiplication is almost fully par-
allelizable. Typically, the columns or rows of the
result matrix are distributed over the multiple
cores. Only if the column/row count is not di-
visible by n, then some cores have slightly to do
more work. We used, 48× 48 matrices, because
48 is divisible by 1, 2, 3, 4, 6, and 8 as well as
the runtimes are well above the timing measure-
ment granularity. Up to 5 cores, the efficiency is
above 95%. If adding more cores, then the effi-
ciency falls perceptible (87% at 8 cores) because
the memory accesses collide frequently, now.
The linear search over 288 elements shows al-
most the same results as the matrix multiplica-
tion. The memory access ratio is slightly lower
due to more control overhead in the search loop.
Thus, speed-ups and efficiencies are better. Up
to 5 cores, 99% efficiency is reached. Even for
8 cores, the speed-up is 7.66 and the efficiency
96%.
The last application runs the script inter-
preter FScript written in Java. This applica-
tion has been chosen, because it has a higher
memory bandwidth utilization (about 16%) as
the other benchmarks, and frequently allocates
new objects. As the interpreter cannot be paral-
lelized (easily), the task is simply run in parallel
on all cores. The longest execution time is taken
and divided by the number of cores n, so as if
the interpreter could be parallelized. Unfortu-
nately, measurements were only possible for up
to 7 cores. If 8 interpreter instances in paral-
lel, then the system goes out of memory. Up to
7 cores, the efficiency is above 90%. For more
cores, it decreases similar to the matrix multi-
plication. Thus, the memory bandwidth is also
sufficient for such application types.
To conclude, as the memory bandwidth is suf-
ficient even for a high degree on parallelism, no
data caches are necessary. Especially, because
the full-duplex memory bus supports pipelined
transactions and bandwidth is required only for
heap not for stack operations.
In comparison to the JopCMP system and
the jamuth IP multi-core, the SHAP multi-core
shows a significantly better performance, cf.
Sec. 3. Our speed-ups show no saturation and
an efficiency above 87% for up to 8 cores. Due to
limited chip-space on the evaluation platform,
the core count could not be further increased.
6 Summary
This paper evaluates a scalable multi-core ar-
chitecture directly executing Java bytecode in
hardware. At first, several concepts for the dis-
tinct memory regions of the Java Virtual Ma-
chine are explored. At second, related multi-
core Java bytecode processors are reviewed and
bottlenecks are discussed. Based in this knowl-
edge, the structure and main features of our
8 SHAP — Scalable Multi-Core
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 1.05
 1  2  3  4  5  6  7  8
Ef
fic
ie
nc
y 
E
Number of Cores n
Linear Search
Matrix Multiply
FScript Test
Figure 3: Efficiency on the ML505 Evaluation Board
multi-core architecture using the SHAP Java
bytecode processor is described: The multi-port
memory manager includes the exact and fully
concurrent garbage collector and connects all
cores to a shared memory by a full-duplex bus
with pipelined transactions. Each core includes
local on-chip memory for the Java operand stack
and the method cache to reduce the required
memory bandwidth. Synchronization is sup-
ported on a per object-basis instead of a single
lock as found in the related projects. Load bal-
ancing is implemented in Java as well. Synthesis
on a Xilinx Virtex-5 FPGA show that the chip-
space increases almost linear with the number
of cores.
Afterwards, the design is evaluated using dif-
ferent benchmarks. Two of three benchmarks
show a good performance of the SHAP multi-
core architecture for up to 8 cores. The shared
memory bandwidth is sufficient even for eight
parallel executed tasks leading to an efficiency
of over 87%. Here the multi-core approach ben-
efits from the stack operations which are exe-
cuted on local memory, and, thus, do not require
any memory bandwidth. The third benchmark
could only be run sevenfold in parallel because
it requires to much memory. This application
also requires a higher memory bandwidth, but
the efficiency is still above 90%. Finally, the
benchmark results show that the our multi-core
system performs significantly better than the re-
lated JopCMP or the jamuth IP multi-core.
References
[1] Radhakrishnan, R., Talla, D., John, L.K.:
Allowing for ILP in an embedded Java
processor. In: ISCA ’00: Proceedings of
the 27th annual international symposium
on Computer architecture, New York, NY,
USA, ACM (2000) 294–305
[2] El-Kharashi, M.W., Elguibaly, F., Li,
K.F.: Adapting tomasulo’s algorithm for
bytecode folding based Java processors.
SIGARCH Comput. Archit. News 29(5)
(2001) 1–8
[3] Sideris, I., Economakos, G., Pekmestzi,
K.: A cache based stack folding technique
for high performance Java processors. In:
JTRES ’06: Proceedings of the 4th in-
ternational workshop on Java technologies
for real-time and embedded systems, New
York, NY, USA, ACM (2006) 48–57
[4] Beck, A.C.S., Carro, L.: A VLIW low
power Java processor for embedded appli-
cations. In: SBCCI ’04: Proceedings of the
17th symposium on Integrated circuits and
system design, New York, NY, USA, ACM
(2004) 157–162
[5] Berekovic, M., Kloos, H., Pirsch, P.: Hard-
ware realization of a Java Virtual Machine
for high performance multimedia applica-
tions. J. VLSI Signal Process. Syst. 22(1)
(1999) 31–43
Java Bytecode Processor 9
[6] Lindholm, T., Yellin, F.: The Java(TM)
Virtual Machine Specification. 2nd edn.
Addison-Wesley Professional (April 1999)
[7] Pitter, C., Schoeberl, M.: Towards a Java
multiprocessor. In: JTRES ’07: Proceed-
ings of the 5th international workshop on
Java technologies for real-time and embed-
ded systems, New York, NY, USA, ACM
(2007) 144–151
[8] Schoeberl, M.: A time predictable Java
processor. In: DATE ’06: Proceedings of
the conference on Design, automation and
test in Europe, 3001 Leuven, Belgium, Bel-
gium, European Design and Automation
Association (2006) 800–805
[9] Zabel, M., Preußer, T.B., Reichel, P.,
Spallek, R.G.: Secure, real-time and multi-
threaded general-purpose embedded Java
microarchitecture. In: Proceedings of the
10th Euromicro Conference on Digital Sys-
tem Design Architectures, Methods and
Tools (DSD 2007), IEEE (August 2007)
59–62
[10] ARM Ltd.: AMBA AXI Protocol, Specifi-
cation. v1.0 edn. (3 2004)
[11] Gupta, A., Fuchs, W.K.: Garbage collec-
tion in a distributed object-oriented sys-
tem. IEEE Trans. on Knowl. and Data
Eng. 5(2) (1993) 257–265
[12] Pitter, C., Schoeberl, M.: Performance
evaluation of a java chip-multiprocessor.
In: SIES, IEEE (2008) 34–42
[13] Uhrig, S.: Evaluation of different mul-
tithreaded and multicore processor con-
figurations for sopc. In Bertels, K., Di-
mopoulos, N.J., Silvano, C., Wong, S., eds.:
SAMOS. Volume 5657 of Lecture Notes in
Computer Science., Springer (2009) 68–77
[14] OPENCORES.ORG: Specification for the:
WISHBONE System-on-Chip (SoC) Inter-
connection Architecture for Portable IP
Cores. Rev. b.3 edn. (7 2002)
[15] El-Kharashi, M.W., ElGuibaly, F., Li,
K.F.: A quantitative study for Java micro-
processor architectural requirements. Part
II: high-level language support. Micropro-
cessors and Microsystems 24(5) (Septem-
ber 2000) 237–250
