Modeling and Simulation of a Many-core Architecture Using SystemC  by Silva, Ana Rita et al.
 Procedia Technology  17 ( 2014 )  146 – 153 
Available online at www.sciencedirect.com
ScienceDirect
2212-0173 © 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of ISEL – Instituto Superior de Engenharia de Lisboa, Lisbon, PORTUGAL.
doi: 10.1016/j.protcy.2014.10.222 
Conference on Electronics, Telecommunications and Computers – CETC 2013
Modeling and Simulation of a Many-Core Architecture Using
SystemC
Ana Rita Silvaa, Wilson Jose´a, Hora´cio Netob, Ma´rio Ve´stiasc,∗
aINESC-ID, Lisbon, Portugal
bINESC-ID, Instituto Superior Te´cnico, Universidade de Lisboa, Lisbon, Portugal
cINESC-ID, Instituto Superior de Engenharia de Lisboa, Instituo Polite´cnico de Lisboa, Lisbon, Portugal
Abstract
Transistor density has made possible the design of massively parallel architectures with hundreds of cores on a single chip. De-
signing eﬃcient architectures with such high number of cores is a very challenging task. Simulation of many-core architectures is
now a fundamental tool for designers to explore the design space. This paper addresses the applicability of SystemC to simulate
many-core architectures. We demonstrate the use of SystemC to model a system of P cores and then simulate the execution of
matrix multiplication. The simulation of the model allows analyzing the results regarding the number of transfers and the number
of clock cycles required to complete each transaction. A theoretical model of the algorithm execution time is used to evaluate the
precision of the system-level simulator. Simulation results indicate that the simulation models are quite precise and simulation
times of a few minutes are possible for systems with a hundred of cores.
c© 2014 The Authors. Published by Elsevier Ltd.
Selection and peer-review under responsibility of ISEL – Instituto Superior de Engenharia de Lisboa.
Keywords: Parallel computing; Many-core Processor; SystemC; Simulation
1. Introduction
During the last decade massively parallel systems have been proposed as high-performance computing architec-
tures. The design process of architectures with hundreds or even thousands of cores on a single die must take care
of the enormous design space where cores can be heterogeneous and there are many structures available for memory
hierarchies and interconnection networks.
These are very complex systems whose simulations must be done at system level since Register-Transfer Level
(RTL) simulations, which are two or three orders of magnitude slower, are unacceptable.
System designers generally use software models as an aid in the development process. These models are used
to validate the performance and correctness of proposed hardware designs through simulation before they are built.
Software models can also be used to compare various designs and conﬁgurations.
∗ Corresponding author. Tel.: +351-218317000.
E-mail address: mvestias@deetc.isel.ipl.pt
   t r . Published by lsevier Ltd. This is an open access article under the CC BY-NC-ND license 
(http://cre tivecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of ISEL – Instituto Superior de Engenharia de Lisboa, Lisbon, PORTUGAL.
147 Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
Software models for simulation of hardware systems are designed with three issues in mind: performance, ﬂexi-
bility, and level of abstraction. The performance of the model itself is restricted by available resources for simulation
support. Flexibility is also important for designers. The ability to make simple modiﬁcations, to vary the design more
signiﬁcantly, or even to use entirely diﬀerent designs is important while exploring diﬀerent architectures or conﬁgu-
rations. Also, the level of abstraction, relates to the level of detail with which the physical hardware is modeled in
software. Designers can choose a level of abstraction that suits their purposes. Generally, ﬂexibility comes at the ex-
pense of detail because more complexity reduces the ﬂexibility of the model. Variations in processor speed, ﬂexibility
and level of abstraction lead to many diﬀerent approaches for modeling hardware designs in software [1].
In this work, we describe our developed system-level simulator of a many core architecture and use it to simulate a
matrix multiplication algorithm. Diﬀerent aspects of the architecture, like the number of cores, as well as algorithmic
variables, are explored by the simulator to identify the best architectural and algorithmic conﬁgurations. Theoretical
models of the algorithms execution time are used to evaluate the precision of the system-level simulator. Considering
the three simulation issues described above, the developed simulator runs at system-level to decrease simulation time
and already has some ﬂexibility that permits to conﬁgure some parameters of the architecture.
Section 2 describes previous proposals of system-level simulators for many-core architectures. Section 3 does
an overview of SystemC . Section 4 describes the many-core architecture to be modeled, the matrix multiplication
algorithm to be simulated and a theoretical model for the number of execution cycles. Section 5 describes the SystemC
model of the architecture. Section VI presents the simulation results.
2. Related Work
HORNET [2] is a parallel, cycle-level multicore simulator based on an wormhole router network-on-chip archi-
tecture. The parallel simulation engine oﬀers cycle-accurate as well as periodic synchronization, while preserving
functional accuracy. This permits tradeoﬀs between perfect timing accuracy and high speed with very good accuracy.
Most hardware parameters are conﬁgurable, including memory hierarchy, interconnect geometry, bandwidth, and
crossbar dimensions. A highly parameterized table-based NoC design allows a variety of routing and virtual channel
allocation algorithms out of the box.
SimMc simulator [3] is an infrastructure used to accelerate the design of many-core processors. The infrastructure
consists of the many-core processor architecture, M-Core, the process simulator, SimMc, and the software library,
MClib, which helps the development of application programs for M-Core. The simulator is designed to simulate
various many-core processor architectures.
SiMany [4] is a new discrete-event simulator for many-core architectures supporting modern task-based program-
ming models, like Cilk [5]. The simulator includes simple models for caches and cores, decreasing the time required
to simulate these components. SiMany can be conﬁgured to explore a number of many-core architectures. The num-
ber of cores and their computing power are tunable, enabling the exploration of diﬀerently sized homogeneous but
also heterogeneous architectures.
The COTSon team at HP labs describes [6] a methodology to eﬃciently simulate a ChipMultiprocessor of hundreds
or thousands of cores using a full system simulator. They consider an idealized architecture with a perfect memory
hierarchy, i.e., without any interconnect, caches nor distribution of memory banks. Their experiments show that the
simulator can scale up to 1024 cores with an average simulation speed overhead of only 30% with respect to the
single-core simulation.
Our system-level simulator is based on SystemC. Using SystemC is advantageous since it allows hardware/software
co-design and co-simulation in a single environment. The existing SystemC tools allow us to concentrate on the
architectural models. It also allows us to structure the architecture in modules that can be changed independently and
so more easily explore the design space. The communication network is modeled independently permitting to explore
the communication without changing the modules. Another important point on using SystemC is that it allows us to
choose the level of abstraction used in speciﬁc models, trading oﬀ accuracy for simulation speed.
The main objective of this work was the development of a many-core simulator in standard SystemC with accept-
able simulation times for hundreds of cores to permit a controlled design space exploration and accurate enough to
support system-level decisions.
148   Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
3. Overview of SystemC
SystemC is a system design language that has evolved in response to a pervasive need for improving the overall
productivity for designers of electronic systems [7]. One of the primary goals of SystemC is to enable system-
level modeling, that is, modeling of systems above RTL, including systems that might be implemented in software,
hardware, or some combination of the two [8].
SystemC is based on the C++ programming language, which is an extensible object oriented modeling language.
It extends the C++ data types with additional types useful for modeling hardware that support all the common op-
erations and provide special methods to access bits and bit ranges. SystemC adds a class library to C++ to extend
its capabilities, eﬀectively adding important concepts such as concurrency, timed events and data types. This class
library is not a modiﬁcation of C++, but a library of functions, data types and other language constructs that are legal
C++ code [9].
In order to reduce complexity and simulation time, the design abstraction of a simulation model can be raised.
Transaction-Level Modeling (TLM) uses a high level of abstraction compared to that of the more detailed register-
transfer level (RTL), and it can also be used to validate systems through simulation. In TLM, the communication
between components is separated from the implementation of those components. This division allows diﬀerent com-
munication and component modules to be used in order to explore a variety of designs while reusing creates modules
as much as possible. Since the release of version 2.0, it has been possible to do TLM using SystemC [10].
Modules are the basic building blocks for partitioning a design in SystemC. A typical module contains processes
that describe the functionality of the module, ports through which the module communicates with the environment,
internal data and channels for maintenance of model state and communication among the modules processes, among
other modules. A SystemC module is simply a C++ class deﬁnition and is described with the SC MODULE macro.
In SystemC the basic unit of functionality is called process. A process must be contained in a module. It is
deﬁned as a member function of the module and declared to be a SystemC process in the constructor of the module.
The processes within a module are concurrent. An event is an object, represented by class sc event, that determines
whether and when a processs execution should be triggered or resumed; it has no value and no duration. An event is
used to represent a condition that may occur during the course of simulation and to control the triggering of processes
accordingly. We can perform only two actions with a sc event: wait for it or cause it to occur.
A SystemC interface is an abstract class that inherits from sc interface and provides only pure virtual declarations
of methods referenced by SystemC channels and ports. No implementations or data are provided in a systemC
interface. A SystemC channel is a class that implements one or more SystemC interface classes and inherits from
either sc channel or sc prim channel. SystemC has two types of channels: primitive and hierarchical. The simplest
channels are sc signal, sc mutex, sc semaphore, and sc ﬁfo.
4. Many-core Architecture and Algorithm Analysis
The many-core architecture to be modeled and simulated is organized as a 2D mesh of cores (see ﬁgure 1).
The parallel architecture has a 2D mesh of homogeneous cores. Each core unit consists mainly of a ﬂoating-point
multiply and accumulate unit (FPMAC) and a dual-port memory. The FPMAC is able to provide a sustained multiply-
add result (2 FLOPs) every cycle. Data is received and sent by the core using FIFOs. Access to the external memory is
controlled with a direct memory access (DMA) module that can deliver burst transactions with one transfer per cycle.
The architecture supports horizontal and vertical broadcasts of data.
Considering this architecture, we now analyze an algorithm for matrix multiplication on a system with p cores. To
facilitate the presentation, we consider that the mesh architecture is square. The results can be easily generalized to
non-square meshes.
We also consider that all matrices involved are square and have the same size, and that its dimensions are multiple
of the partitioned blocks dimensions. Again, this consideration does not limit in any way the generality of the results.
We deﬁne matrix C to be the result of the product between two square matrices, A and B, both with dimension
n × n. Also, we deﬁne Ci j to be the block matrix of size y × x that results from the product of block Ain (with size
y × n) of matrix A and block Bn j (with size n × x) of matrix B.
149 Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
FMAC
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
D
M
A External 
Memory
MEM
Fig. 1. Many-core architecture
Each of the p = q × q cores, where q = √p, is responsible for calculating one block of matrix C with size
n
q × nq . Each one of these blocks is partitioned, according to the memory limitations of the core, in sub blocks Ci j with
dimension y × x.
To generate a Ci j block, the core must multiply a y × n block of matrix A with a n × x block of matrix B. This
multiplication can be segmented as a sequence of k0 = nz block multiplications as speciﬁed in equation (1).
Ci j =
k0∑
k=1
Aik × Bk j (1)
Therefore, each partial block multiplication consists of the multiplication of a y × z sub block Aik from matrix A
with a z × x sub block Bk j from matrix B, resulting in a partial sub block of size y × x. The ﬁnal Ci j result is obtained
after accumulating the k0 (partial) block multiplications.
With this algorithm all the cores in the same row require the same data from matrix A, while all the cores in the
same column require the same data from matrix B. Therefore, each sub block fetched from memory is broadcasted to
a row (column) of
√
p cores. The total number of communications is given by
Ncomm =
n3√
p
(
1
x
+
1
y
)
+ n2 (2)
We note that the number of communications does not depend on the dimension z of the sub blocks from matrix A
and matrix B, thus z can be simply made equal to 1 in order to minimize the local memory required.
The partial block multiplication is implemented such that, ﬁrst, all cores receive the sub block Aik with y elements
and stores it. Then each receives the elements of a sub block Bk j which are multiplied by the corresponding locally
stored elements of Aik, resulting in a partial sub block Ci j. This process repeats n times, after which the ﬁnal sub block
Ci j is obtained.
The total number of computation cycles, assuming a core throughput of one accumulation per cycle, is given by
Ncompcycles =
n3
p
(3)
where n3 is the total number of multiply-add operations to be performed by the p cores.
The minimal execution time is achieved when all the communications, except the initial overhead, can be totally
overlapped with the computations, that is, when the number of communication cycles required is lower than the
number of computation cycles, Ncommcycles ≤ Ncompcycles.
If there is full overlap, the total number of execution cycles is given by
Nexeccycles =
n3
p
+ OvhdINI + OvhdEND ≈ n
3
p
(4)
150   Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
Processors
2D-Mesh
FIFO
FIFO
FIFO
FIFO outDatainData
Fig. 2. Model in SystemC
The initial overhead, Ohvdinitial, corresponds to the number of cycles it takes until all data becomes available for the
last cores to initiate the computations (that is, when the ﬁrst element of the last required block of B arrives to the last
core). The ﬁnal overhead, Ohvdend, corresponds to the additional number of cycles needed to write back the results
obtained by the last core. The expression for the initial overhead is as follows:
Ovhdinitial =
√
py + (
√
p − 1)x + 1 (5)
The initial overhead sum is the number of cycles required to send
√
p blocks A and
√
p − 1 blocks B for the
respective processors and one more clock cycle necessary for the last processor to receive the ﬁrst value of the last
block B and start the calculations.
The expression for the ﬁnal overhead is approximated as
Ovhdend ≈ N
2
p
(6)
The overhead includes the number of cycles (equals the number of C elements it calculates) that the last processor
needs to send results to the memory.
When the number of communication cycles required is higher than the number of computation cycles, Ncommcycles ≥
Ncompcycles, the total execution time is given by Nexeccycles ≈ Ncommcycles.
5. SystemC Modeling of the Architecture
The model consists of three modules, where modules inData and outData represent the DMA and module Proces-
sors represents the 2D-Array (see ﬁgure 2).
The communication between modules is performed by FIFOs. It is necessary one FIFO to communicate with
module outData and, for each core, one FIFO to communicate with module inData.
Module inData depends on the speciﬁc algorithm. In the analyzed algorithm, it stores the values of matrices A and
B and consists of two thread processes, (SC THREAD), which are responsible for sending data in the correct order
to the cores through their FIFOs. The ﬁrst process is responsible for sending data from matrix A and the second for
sending data from matrix B. The module sends the data words one at a time, starting by sending one block A followed
by one block B and so on. The synchronization of the two processes is maintained by two events sc event triggered
by the respective threads (see code in Listing 1);
The module inData stops and waits, whenever the module Processors wants to send data to module outData.
The module Processors consists of p thread processes, being p the number of cores. This is basically the model of
the 2D mesh array of cores. Each thread process receives and stores the two blocks, from A and B, and performs the
operations. The operations are initiated at the moment the ﬁrst value of the second block is available. The product
between the two blocks corresponds to partial results of block C, which are stored in the core (see code in Listing 2).
After obtaining the ﬁnal results, the cores send the data one word at a time to module outData through a FIFO.
In order to reduce simulation time we use FIFOs implemented based on simple variables and sc events instead of
sc ﬁfo or tlm ﬁfo. The producer writes data to the FIFO when this is not full and the consumer reads data from the
FIFO when this is not empty (see example of a producer and a consumer in listing 3 and 4).
To support diﬀerent algorithms and applications to be executed in the many-core architecture and simulated, mod-
ule inData is conﬁgurable specifying the order in which data is sent to module Processors. Module Processors is also
conﬁgurable with the number of cores, the size of FIFOs as well as the code to run in each processor.
151 Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
Listing 1 - Model in SystemC of module Matrix Listing 2 - Model in SystemC of module Processors
SC_MODULE(inData) {
sc_event MA_event , MB_event;
void blocoA_thread ();
void blocoB_thread ();
SC_CTOR(inData){
SC_THREAD(blocoA_thread);
SC_THREAD(blocoB_thread);
}
void writeFIFOA (...)
void writeFIFOB (...)
void WAL (...) void WBL (...)
void WA(...) void WB(){
notify(MB_event);
wait(MA_event); }
SC_MODULE(Processors) {
void Calc_BA(int *C, int *A, int*B){
(...)
for (int i=0; i<tAl; i++){
z=i*tAc;
for (int j=tBc; j<(tBl*tBc); j++){
mul = A[z] * B[j];
sum = mul+sum;
z++;
if(z==tBc*(i+1)){
z=i*tAc;
C[x]=C[x]+sum; x=x+1;
if(x%tBl ==0) x=x+1; sum =0;
}}}} (...) };
Listing 3 - Model in SystemC of a FIFO read Listing 4 - Model in SystemC of a write FIFO
void readFIFO(int *count , int *i1, int *A,
int *FIFO , sc_event *ev_in , sc_event *ev_out){
if(* count ==0) wait(*ev_in);
&A=FIFO[*i1];
*i1=*i1+1;
*count=*count -1;
ev_out ->notify ();
if (*i1== limite) *i1=0;}
void writeFIFO(int *f, int *count , int *FIFO ,
sc_event *ev_in , sc_event *ev_out , int A){
if(* count==limit) wait(* ev_out);
FIFO[*f]=A;
*f=*f+1;
*count =* count +1;
ev_in ->notify ();
if (*f== limit) *f=0;}
6. Results
The simulator accepts a set of conﬁguration parameters of the architecture, the algorithm to execute and the con-
ﬁguration of the input and output models and generates a SystemC description of the system to be simulated.
The number of cycles is obtained considering one cycle to write to the FIFO and that multiplication-addition unit
produces one result per cycle. We present a set of initial results with only four processors since the objective is to
show some tradeoﬀs in the design of the simulator. In the end, results with 16 and 100 processors are included.
Tables 1 and 2 show the results considering FIFOs with suﬃcient size to obtain the fewest number of cycles.
Table 1. Results of the simulation considering 4 cores, matrices of size 128 × 128 and diﬀerent sizes of sub-blocks
Block A Block B Communications Cycles Simul. time
4 × 1 1 × 4 540.672 540.682 13 s
4 × 1 1 × 8 409600 528488 7 s
4 × 1 1 × 16 344.064 528.584 6 s
8 × 1 1 × 2 671.744 671.751 20 s
8 × 1 1 × 4 409.600 528.492 8 s
8 × 1 1 × 8 278.528 528.588 5 s
From the results, we conclude that the number of communications and the number of clock cycles are consistent
with equations 2 and 4. Considering the case of matrices with size 1024 × 1024 and blocks A and B with sizes,
respectively, 4 × 1 and 1 × 4 the total number of communications and processing cycles are given by
Ncomm =
10243√
4
(
1
8
+
1
4
)
+ 10242 = 202.375.168 (7)
Nexeccycles =
10243
4
+ (24˙ + 8 + 1) +
10242
4
= 268.697.617 (8)
152   Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
Table 2. Results of the simulation considering 4 cores, matrices of size 1024 × 1024 and diﬀerent sizes of sub-blocks
Block A Block B Communications Cycles Simul. time
4 × 1 1 × 8 202375168 268697704 70 min
4 × 1 1 × 16 168.820.736 268.697.800 40 min
8 × 1 1 × 8 135.266.304 268.697.804 37 min
8 × 1 1 × 16 101.711.872 268.697.996 26 min
y=4 
x=4 
y=4 
x=8 
y=4 
x=16 
y=8 
x=8 y=8 
x=16 
y=4 
x=4 
y=4 
x=8 
y=4 
x=16 
y=8 
x=8 
y=8 
x=16 
0
1.000
2.000
3.000
4.000
5.000
6.000
16 32 64 128
Si
m
ul
at
io
n 
T
im
e 
(s
) 
Size of Block C 
Model M0 and M1 with 4 processing 
elements and matrix with size 1024x1024 
Model M1
Model M0 y=4 
x=8 y=4 
x=16 
y=8 
x=8 
y=16 
x=8 
y=8 
x=16 y=16 
x=16 
y=4 
x=4 
268.435.456 
5,0E+07
1,0E+08
1,5E+08
2,0E+08
2,5E+08
3,0E+08
16 32 64 128 256
N
um
be
r 
of
 c
yc
le
s 
Size of block C 
Model M1 with 4 processing elements and 
matrix with size 1024x1024 
Communication
Computation
Fig. 3. a) Results of the simulation considering 4 cores, matrices of size 1024× 1024 and diﬀerent sizes of sub-blocks; b) Results of the simulation
considering 4 cores, matrices of size 1024 × 1024 and model M1
With more processors larger matrices are multiplied in fewer cycles. However, as seen in the tables 1 and 2,
increasing the size of the matrices increases the number of communications and consequently the simulation time.
In order to reduce the simulation time we did some changes in the model and raised the level of abstraction. For
example, in the model considered in the previous simulations, M0, the sending of the ﬁnal results from the core to
the memory is done cycle-by-cycle. In the new model, M1, the core sends the ﬁnal results and then performs a single
wait, waiting an amount equal to the number of results. Table 3 shows the results with model M1.
Table 3. Results of the simulation considering 4 cores, matrices of size 1024 × 1024 and model M1
Block A Block B Communications Cycles Simul. time
4 × 1 1 × 8 202.375.168 268.697.702 30 min
4 × 1 1 × 16 168.820.736 268.697.798 16 min
8 × 1 1 × 8 135.266.304 268.697.802 14 min
8 × 1 1 × 16 101.711.872 268.697.994 8 min
16 × 1 1 × 8 101.711.872 268.698.002 10 min
16 × 1 1 × 16 68.157.440 268.698.386 5 min
Graph 3a compares model M0 with model M1, regarding the simulation time. As expected, the number of transfer
cycles required decreases with the size of the C blocks (x×y) (see ﬁgure 3b). The minimal number of execution cycles
is achieved when all the communications, except the initial overhead, can be totally overlapped with the computations.
The size of block C should be suﬃcient for the computation cycles to dominate, that is, for the transfers to fully
overlap with the computations, but not larger. As long as there is full overlap, there is no need to further increase the
size of the C blocks. In fact and as shown, above the ideal values the total number of clock cycles slightly increases
with x and y. This is because the initial and ﬁnal overheads are proportional to the sizes of the matrix blocks.
Comparing the results of table 2 with those of table 3, we ﬁnd that, when the size of the C block is not suﬃcient
for the computation cycles to dominate, model M1 is not as accurate as a model in a low simulation-level (number of
total execution cycles is slightly smaller than the number of communication cycles).
When the size of block C is not suﬃcient the number of cycles increased because the size of the sub blocks are too
small, and therefore the time taken to perform the calculations is not enough for the memory to send new data to the
153 Ana Rita Silva et al. /  Procedia Technology  17 ( 2014 )  146 – 153 
core. The core does not have enough data to start the calculations after writing the ﬁnal results in the memory, and so
there is a delay. This delay precisely counted when we implement the model with reduce simulation time.
Whenever we obtain a number of communications higher than the number of cycles in model M1, the result is not
correct and the size of block C is not suﬃcient to obtain the fewest number of cycles. In these cases, we should use
the model M0 instead of M1. In cases where the size of block C is suﬃcient, the diﬀerence in the execution cycles is
negligible. Therefore, the faster model, M1, is quite valid and acceptable for a system-level simulation in those cases.
The simulation times are tightly related to the number of communication cycles. The number of processors has
low eﬀect over the simulation times. We have simulated systems with 16 and 100 cores with model M1 (see table 4).
Table 4. Results considering 16 cores with matrices of size 1024 × 1024 and 100 cores with matrices of size 1000 × 1000 both with model M1
Cores Block A Block B Communications Cycles Simul. time
16 8 × 1 1 × 16 51.380.224 67.219.869 13 min
16 16 × 1 1 × 8 51.380.224 67.176.340 9 min
16 32 × 1 1 × 8 26.214.400 67.182.116 3 min
100 20 × 1 1 × 20 11.000.000 10.058.129 3 min
100 25 × 1 1 × 100 7.000.000 10.257.541 4 min
100 50 × 1 1 × 50 5.000.000 10.257.568 3 min
Simulation times are low and decrease with the number of cores since the number of communications also de-
creases.
7. Conclusion
In this paper we have described an approach to simulate a many-core architecture using SystemC. A matrix multi-
plication algorithm to execute on a 2-dimensional multiprocessor array were presented and analyzed theoretically. An
architecture was developed and implemented in SystemC in order to model the many-core system. We simulated the
model to evaluate number of transfers and number of clock cycles required for the complete algorithms execution. The
simulated results fully conﬁrmed the theoretical analysis (the diﬀerences are less than 2%) and demonstrated the very
good accuracy of the simulation model. We have also proposed and tested two diﬀerent models of the architecture
with diﬀerent tradeoﬀs between simulation time and accuracy.
Acknowledgment
This work was supported by national funds through FCT, Fundac¸a˜o para a Cieˆncia e Tecnologia, under projects
PEst-OE/EEI/LA0021/2013 and PTDC/EEA-ELC/122098/2010.
References
[1] Nicholas Ma, ”Modeling and Evaluation of Multi-core Multithreaded Processor Architectures in SystemC”, Master Thesis, 2007.
[2] M. Lis, R. Pengju, Myong Hyon Cho, Keun Sup Shim, C.W. Fletcher, O. Khan, S. Devadas, ”Scalable, accurate multicore simulation in the
1000-core era,” 2011 IEEE International Symposium on Performance Analysis of Systems and Software, pp.175-185, April 2011.
[3] K. Uehara, S. Sato, T. Miyoshi, K. Kise, ”A Study of an Infrastructure for Research and Development of Many-Core Processors,” Parallel and
Distributed Computing, Applications and Technologies, 2009 International Conference on , vol., no., pp.414-419, 8-11 Dec. 2009.
[4] O. Certner, Zheng Li, A. Raman, O. Temam, ”A Very Fast Simulator for Exploring the Many-Core Future,” IEEE International Symposium on
Parallel & Distributed Processing, pp.443-454, 16-20 May 2011.
[5] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou., ”Cilk: An eﬃcient multithread runtime system,” in 5th Symposium
on Principles and Practice of Parallel Programming, 1995.
[6] M. Monchiero, J. Ho Ahn, A. Falcon, D. Ortega, and P. Faraboschi, ”How to simulate 1000 cores”, Technical Report HPL-2008-190, Hewlett
Packard Laboratories, November 9, 2008.
[7] D. Black and J. Donovan, ”SystemC: from the Ground Up”, Kluwer Academic Publishers, 2004.
[8] T. Grtker, S. Liao, G. Martin and S. Swan, ”System Design with SystemC”, Kluwer Academic Publishers, 2002.
[9] J. Bhasker, ”A SystemC Primer”, Star Galaxy Publishing, 2002.
[10] A. Rose, S. Swan, J. Pierce, J. Fernandez, ”Transaction Level Modeling in SystemC,” Cadence Design Systems, Inc.
