Simulações concorrentes em SystemC TLM-2 by Falcão, Tiago Rezende Campos, 1987-
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Tiago Rezende Campos Falcão
Concurrent SystemC TLM-2 Simulations
Simulações Concorrentes em SystemC TLM-2
CAMPINAS
2017
Tiago Rezende Campos Falcão
Concurrent SystemC TLM-2 Simulations
Simulações Concorrentes em SystemC TLM-2
Dissertação apresentada ao Instituto de
Computação da Universidade Estadual de
Campinas como parte dos requisitos para a
obtenção do título de Mestre em Ciência da
Computação.
Thesis presented to the Institute of Computing
of the University of Campinas in partial
fulfillment of the requirements for the degree of
Master in Computer Science.
Supervisor/Orientador: Prof. Dr. Rodolfo Jardim de Azevedo
Este exemplar corresponde à versão final da
Dissertação defendida por Tiago Rezende
Campos Falcão e orientada pelo Prof. Dr.
Rodolfo Jardim de Azevedo.
CAMPINAS
2017
Agência(s) de fomento e nº(s) de processo(s): CNPq, 182379/2008-6
ORCID: http://orcid.org/0000-0003-4060-053X
Ficha catalográfica
Universidade Estadual de Campinas
Biblioteca do Instituto de Matemática, Estatística e Computação Científica
Ana Regina Machado - CRB 8/5467
Falcão, Tiago Rezende Campos, 1987-
F181c FalConcurrent SystemC TLM-2 simulations / Tiago Rezende Campos Falcão. –
Campinas, SP : [s.n.], 2017.
FalOrientador: Rodolfo Jardim de Azevedo.
FalDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de
Computação.
Fal1. Simulação (Computadores). 2. SystemC. 3. Programação paralela
(Computação). 4. Hardware - Engenharia de sistemas. 5. Arquitetura de
computador. 6. Sistemas embutidos de computador. I. Azevedo, Rodolfo
Jardim de,1974-. II. Universidade Estadual de Campinas. Instituto de
Computação. III. Título.
Informações para Biblioteca Digital
Título em outro idioma: Simulações concorrentes em SystemC TLM-2
Palavras-chave em inglês:
Computer simulation
SystemC
Parallel programming (Computer science)
Hardware - Systems engineering
Computer architecture
Embedded computer systems
Área de concentração: Ciência da Computação
Titulação: Mestre em Ciência da Computação
Banca examinadora:
Rodolfo Jardim de Azevedo [Orientador]
Bruno de Carvalho Albertini
Sandro Rigo
Data de defesa: 03-02-2017
Programa de Pós-Graduação: Ciência da Computação
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Tiago Rezende Campos Falcão
Concurrent SystemC TLM-2 Simulations
Simulações Concorrentes em SystemC TLM-2
Banca Examinadora:
• Prof. Dr. Rodolfo Jardim de Azevedo
Universidade Estadual de Campinas
• Prof. Dr. Bruno de Carvalho Albertini
Universidade de São Paulo
• Prof. Dr. Sandro Rigo
Universidade Estadual de Campinas
A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no
processo de vida acadêmica do aluno.
Campinas, 03 de fevereiro de 2017
Acknowledgements
First of all, I would like to thank you for reading this text, especially the members of the
examination board. This thesis is a significant step from this journey in my life. It tries
to represent the knowledge absorbed during this time. Few people understand how hard
it was and how much support I needed to did not give up.
A long time ago, I listened that one advisor can have many responsibilities that make
him be like a father, a psychologist, a friend, a guru, an example, and more. I could not
have had an advisor that meet better this description. I have a lot to thank Prof. Rodolfo
for his time, words, presence, and support, in this journey. And mention Prof. Torres, a
great professor that lecture my first classes of programming, who was my undergraduate
advisor, and I hold a strong appreciation.
This master’s dissertation would never be complete without Vanessa’s unconditional
fellowship. She was the fundamental of this life step and to whom I dedicate the end of
this work. Many nights, trips, and moments lost to this work. Including sleeping nights
in the lab to support me when I had no more energy to complete it.
I can not leave my family outside. All of them give me the support afar to keep going,
many times with extreme confidence in my work. My father is an example of academic
trajectory, a long and prestigious career in Dentistry as professor and researcher.
Many laboratory colleagues passed during this course. Each one left their mark and
inflamed me. Now, they plot their careers in great companies and universities. Citing few
of them: Maxiwell, Auler, Liana, Gabriel, Piga, Raoni, Alexandro, Portavales, Lois, Laís,
Mario, Nicácio, and João. In particular, a friend of last minute, Emílio.
Some companies have direct and indirect helped this work. First of all, ProFUSION
gave me a job and evolve my skills to real world projects. All of my colleagues were
fantastic and this help to love this company and the team. Second, Intel (who later
bought ProFUSION) support us with experimental hardware. Moreover, at last, but not
minor, Google that gave me an opportunity to represent them in my university and, by
them, I meet people like Diego and Rodrigo; and now it motivates me to go further.
I could define this time as a change in my life. I tried and learned too many things
that are not only limited to deepening my research area. Different from my admission,
I am now a citizen with political participation in the future of our society, not a critic
inside a social network. And I opened my mind to learn medicine, western and Chinese,
martial arts, and esoteric things.
I dedicated many time to Sonhar Acordado NGO. They propose a better society by
developing the younger citizens, trough helping children. We have more 500 monthly
volunteers only in Campinas and made events for more than 5 thousands of people.
My participation at Sonhar is another phase ending with this dissertation; new social
commitments will come. Also, forever they will have access to my distance counseling
and aid.
I must acknowledge the governmental institutions, CAPES and CNPq, and the Com-
puting Institute for the financial foundation of only a few months because of their respec-
tive rules.
Resumo
A simulação é uma etapa importante no desenvolvimento de sistemas computacionais;
nesta, a corretude, comportamento e desempenho do sistema em desenvolvimento são
avaliados. SystemC é uma Linguagem de Descrição de Sistemas (SLDL), uma extensão à
linguagem C++ para o suporte a diferentes abstrações. Ela simula todo o sistema sequen-
cialmente sem aproveitar possível potêncial de processamento paralelo. Essa dissertação
propõe uma abordagem genérica para permitir a simulação de componentes SystemC,
isolados ou agrupados, em um processo distinto, que podem ser escalonadado em diferen-
tes núcleos em um computador ou num sistema distribuído. As simulações são anotadas
e podem ser executadas como uma simulação SystemC tradicional ou particionada em
múltiplos processos com diferente opções de comunicação. A principal vantagem desta
abordagem é paralelizar simulações SystemC sem necessitar modificar os modelos. Para
tanto, a comunicação assíncrona em nível de transações é serializada e encapsulada em co-
municações entre processos por módulos que abstraem a necessidade do compartilhamento
do espaço de endereçamento de memória. A substituição da comunicação proporciona um
overhead em cada canal encapsulado, que deve ser minimizado com a carga em cada nó
de simulação. Esse trabalho também contribui com a emulação das chamadas de sistema
que devem ter um comportamento local mesmo que a simulação seja distribuída.
Abstract
Simulation is one of the main stages in the validation process in systems design; in this
stage, system architects can verify the correctness, behavior, and performance of the tar-
get system. SystemC is a System Level Description Language (SLDL), a C++ language
extension that supports different abstraction levels. The downside is its sequential simu-
lation execution that does not take advantage of the parallel processing capabilities. This
dissertation proposes a generic technique that allows the simulation of a set of SystemC
components by encapsulating each one in a process, which can be scheduled over cores or
distributed on a cluster. The simulations are annotated and can execute as a traditional
SystemC simulation or parted in multiple processes with different options of communica-
tion. The main advantage of this approach is that it parallelize SystemC-TLM2 simulators
using the original SystemC Kernel and models. Therefore, the asynchronous transaction
level communication is serialized and encapsulated over TCP/IP with wrapper modules
that abstract the shared memory address space requirement. The replacement of com-
munication from a simple function call to an internet protocol give an overhead of 37x
in each communication hop, which should be minimized with the increase of load in each
simulation node. This work also contributes with syscall emulation, which requires a local
behavior even when distributed.
List of Figures
1.1 ENIAC – first computer (U.S. Army photo) . . . . . . . . . . . . . . . . . 18
1.2 Transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Transistors evolution on processors [9] . . . . . . . . . . . . . . . . . . . . . 20
1.5 From frequency to multi-cores [9] . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 NoC Example: Arbitrary number of processing units (PU) connected to
routers (R) by their network interfaces (NI). . . . . . . . . . . . . . . . . . 21
1.7 DES loop: Fetch current events, process, stores new events, advances sim-
ulation time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 An idealized top-down ESL design flow [1] . . . . . . . . . . . . . . . . . . 25
2.2 DES loop: Fetch current events, process, stores new events, advances sim-
ulation time. (Same as Figure 1.7) . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Distributed system: This figure shows an example of distributed system
with 3 Logical Processes (LPs). LP1 generates a signal A that is evaluated
by LP2, and a signal B that is evaluated by LP3. LP2 receives signal C
from LP3 and generates signal D that is received by LP1. [19] . . . . . . . 27
2.4 Pin’s software architecture [20] . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 SystemC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Diagram of Transaction Level Modeling (TLM) payload . . . . . . . . . . . 32
2.7 Diagram of OSCI TLM-2.0 (TLM-2.0) communication between core and
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 ADL-based (ArchC) exploration flow [1] . . . . . . . . . . . . . . . . . . . 36
2.9 AMPSoCBench platform designed as a mesh consisting of processors, mem-
ory and IPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 A taxonomy of Unix Inter-Process Communication (IPC) facilities [51] . . 40
2.11 Two processes with a shared mapping of the same memory region [51] . . . 41
2.12 Memory address space for Linux/x86-32 [51] . . . . . . . . . . . . . . . . . 42
2.13 Overview of system calls used with sockets [51] . . . . . . . . . . . . . . . . 43
2.14 Connected TCP sockets [51] . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.15 Top-level module and automated partition from Cox [58] . . . . . . . . . . 49
2.16 Modification of SystemC scheduler by Ezudheen et al. [24] . . . . . . . . . 50
2.17 Comparing sequential and parallel simulation for Schumacher et al. [61] . . 51
2.18 Dynamic management of processes groups by Schumacher et al. [62] . . . . 52
2.19 RAVES architecture [65] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.20 The architecture of the ArchSim simulation platform [67] . . . . . . . . . . 54
2.21 A RTL platform to parallel simulation using ArchSC [67] . . . . . . . . . . 54
2.22 SCGPSim [68] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.23 Design flow overview of parallel GPU-CPU [70] . . . . . . . . . . . . . . . 55
2.24 SAGA methodology steps [71] . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.25 Reder et al. [73] tool flow integrating static and dynamic model analysis. . 57
2.26 SystemC-SMP with TLM-DT modules [75] . . . . . . . . . . . . . . . . . . 58
2.27 Peeters et al. [77] example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.28 Peeters et al. [77] wrapper. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.29 Sauer et al. [79] CoMix manual partition. . . . . . . . . . . . . . . . . . . . 60
2.30 SCale kernel diagram [84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.31 Execution of a Program with During Tasks [4] . . . . . . . . . . . . . . . . 63
3.1 TLM-2.0 communication between core and memory (Similar to Figure 2.7). 67
3.2 TLM-2.0 communication between core and memory with YAPSC executing
on 2 distinct domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Diagram of 2×2 NoC example . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 2×2 Network on-chip (NoC) example divided by tiles . . . . . . . . . . . . 73
3.5 2×2 NoC example divided by components . . . . . . . . . . . . . . . . . . 73
3.6 Peeters et al. [77] wrappers (Same of Figure 2.28). . . . . . . . . . . . . . . 75
3.7 Yet Another Parallel SystemC (YAPSC) wrappers (Same of Figure 3.2). . 75
3.8 YAPSC connection of the top routers of example with 2×2 NoC (Figure 3.4) 76
3.9 Diagram of TLM payload with YAPSC extension (Similar to Figure 2.6) . 76
4.1 Producer-Consumer simulation (Similar to Figure 3.7). . . . . . . . . . . . 82
4.2 Latency (seconds) waiting payload answer . . . . . . . . . . . . . . . . . . 88
4.3 Latency (seconds) without wait payload answer . . . . . . . . . . . . . . . 89
4.4 YAPSC simulation of processor and memory in 2 domains (Similar to Fig-
ure 3.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5 YAPSC simulation time (seconds) of processor without cache . . . . . . . . 92
4.6 YAPSC simulation time (seconds) of processor with cache . . . . . . . . . 93
A.1 Comunicação TLM2 entre componentes de núcleo e memória . . . . . . . . 108
A.2 Conexões com o módulo TCP . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.3 A mensagem TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.4 Comunicação TLM2 entre um núcleo e uma memória com módulo TCP . . 113
A.5 Exemplo de NoC 2x2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
List of Tables
2.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Localhost latency (seconds) waiting payload answer . . . . . . . . . . . . . 88
4.2 Latency (seconds) without wait payload answer . . . . . . . . . . . . . . . 89
4.3 YAPSC simulation time (seconds) of processor without cache . . . . . . . . 92
4.4 YAPSC simulation time (seconds) of processor with cache . . . . . . . . . 93
A.1 Configurações testadas para o módulo Produtor-Consumidor . . . . . . . . 116
List of Listings
2.1 SystemC top-level description of a minimalist system [1] . . . . . . . . . . 29
2.2 Example of Register-Transfer Level (RTL) and port in SystemC . . . . . . 31
2.3 Generic layout of a TLM2 processor . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Generic layout of a TLM2 memory . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Generic layout of a processor and memory bind with TLM2 sockets . . . . 35
2.6 Excerpt of the MIPS AC_ISA Description . . . . . . . . . . . . . . . . . . 37
2.7 Excerpt of the MIPS AC_ARCH Description . . . . . . . . . . . . . . . . 38
2.8 MPI program that prints greetings from the processes [2] . . . . . . . . . . 45
2.9 Connection of two RTL routers using Trams [3] . . . . . . . . . . . . . . . 47
2.10 Parallelism With Duration in SystemC [4] . . . . . . . . . . . . . . . . . . 62
3.1 Processor and Memory using TLM-2.0 . . . . . . . . . . . . . . . . . . . . 68
3.2 Processor and Memory using TLM-2.0 with YAPSC API . . . . . . . . . . 69
3.3 Controller Message Description . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Partition description file to Figure 3.4 . . . . . . . . . . . . . . . . . . . . . 72
3.5 Partition description file to Figure 3.5 . . . . . . . . . . . . . . . . . . . . . 72
3.6 Payload Message Description . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.7 Syscall Message Description: included in Listing 3.3 . . . . . . . . . . . . . 79
4.1 Consumer module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Producer module with sc_event . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Producer module without sc_event . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Latency experiment: SystemC, standard back-end . . . . . . . . . . . . . . 84
4.5 Latency experiment: UDS back-end . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Latency experiment: MPI back-end . . . . . . . . . . . . . . . . . . . . . . 86
4.7 Latency experiment: TCP back-end . . . . . . . . . . . . . . . . . . . . . . 87
4.8 Partition description file to Figure 4.4 . . . . . . . . . . . . . . . . . . . . . 90
4.9 YAPSC simulation of processor and memory . . . . . . . . . . . . . . . . . 91
A.1 Trecho da conexão de dois roteadores sem o módulo TCP . . . . . . . . . . 113
A.2 Trecho da conexão de dois roteadores usando o módulo TCP . . . . . . . . 113
Glossary
SC_MODULE SystemC module. 22, 48, 54, 70–77, 80, 94, 95
SC_PROCESS SystemC module processes. 53, 61, 80
SC_THREAD SystemC module threads. 54, 57, 58, 61, 62
sc_main SystemC main function. 66, 95
sc_signal SystemC blocking signal. 48
sc_start SystemC start method. 48, 74, 80, 81
ADL Architecture Description Language. 12, 24, 35, 36, 38, 39
API Application Program Interface is a set of routines, protocols, and tools for building
software applications. 11–13, 22, 46, 50, 66, 67, 69, 81, 95
ArchC is an Architecture Description Language (ADL) that follow SystemC syntax style
for processor models for system and platform representations. 79, 80, 82, 88, 89, 92,
94
AST Abstract Syntax Tree. 54, 56
AT Approximately Timed. 57
back-end , in YAPSC, represents the IPC used. 11, 67, 70, 71, 74, 75, 77, 78, 80, 84–90,
92–96
BCA Bus Cycle Accurate. 57
CPU Central Processing Unit. 55, 58
CUDA is a parallel computing platform and API model created by Nvidia to GPUs. 54
DAR dynamic abstract representation. 56
DBT Dynamic Binary Translation. 27, 28, 65
DES Discrete Event Simulations. 22, 26, 29, 30
DFS dynamic frequency scaling. 39
DMI Direct Memory Interface. 61, 76
DSM Distributed Shared Memory. 65
EDA Electronic Design Automation. 25
endianness Endianness is the order of the bytes that compose a digital word in computer
memory. It also describes the order of byte transmission over a digital link. Words
may be represented in big-endian, middle-endian, or little-endian format. 48, 71,
78, 95
ESL Eletronic System Level. 16, 24–26, 30, 31, 35
GPU Graphics Processing Unit. 55
ID Identifier. 77, 80
IEEE Institute of Electrical and Electronics Engineers. 24
IMC Interface Method Calls. 32, 51, 52, 61
IP Internet Protocol. 44, 59, 95
IPC Inter-Process Communication. 8, 16, 22, 24, 40–44, 77, 82, 88, 89, 94
ISA Instruction Set Architecture. 36, 38
ISS Instruction-Set Simulator. 35, 36, 38
JIT just-in-time. 27
Linux , in this text, is considered as the GNU/Linux OS. 52, 95
LLVM LLVM Compiler Infrastructure. 56
LP Logical Process. 26, 27, 48, 49, 52, 63, 70–72, 74, 75, 80, 90, 95
LT Loosely Timed. 57
mDNS multicast Domain Name System. 71, 95
MPI Message Passing Interface. 45, 47, 48, 58, 59, 63, 65, 70, 75, 78, 80, 83, 88, 89, 96
MPSoC Multiprocessor System-on-Chip. 39, 46, 51, 52, 58, 95
NoC Network on-chip. 9, 21, 39, 46, 52, 72, 73, 76, 80
OpenMP is an API that supports multi-platform shared memory multiprocessing pro-
gramming. 50, 53
OS operating system. 40, 43, 51–53, 61, 65, 71, 79
payload TLM message object. 11, 75–78, 80, 81
PDES Parallel Discrete Event Simulations. 26, 35, 46–48, 52, 57, 58, 65, 70
POSIX , Portable Operating System Interface, is a family of standards specified by the
IEEE for maintaining compatibility between OSs. 58
protobuf Protocol Buffers are Google’s language-neutral, platform-neutral, extensible
mechanism for serializing structured data. 71, 78–80
PW YAPSC proxy wrapper module. 74, 75, 77, 80
QuickThread is a toolkit for building user-level threads packages. 29, 58
RAW read-after-write. 56
RISC Reduced Instruction Set Computer. 47
RTL Register-Transfer Level. 11, 25, 30, 31, 38, 46–48, 54, 55, 63
RW YAPSC remote wrapper module. 74, 75, 77, 80
SAR static abstract representation. 56
SHM shared memory. 63, 65, 70, 71, 77, 88
SLDL System Level Description Language. 7, 28
SMP symmetric multiprocessing. 52, 57–59, 95
SoC System-on-Chip. 19, 21, 24, 25, 28, 35, 38, 59, 64
syscall is the programmatic way in which a computer program requests a service from
the kernel of the OS it is executed on. 7, 38, 39, 41, 43, 57, 65, 79, 80, 89, 92, 94, 95
SystemC is a System Level Description Language using C++ library [5]. 12, 22, 24, 29,
35, 46–49, 51–63, 65, 66, 70, 72, 75, 76, 78, 80–83, 88–90, 92–95
TCP Transmission Control Protocol. 44
TCP/IP TCP over IP stack. 22, 44, 47, 54, 59, 65, 70, 71, 78, 83, 88, 89, 93–96
TLM Open SystemC Initiative Transaction Level Modeling. 8, 9, 30–32, 38, 39, 46, 52,
55, 57, 61, 63, 76
TLM-2.0 OSCI Transaction Level Modeling version 2.0. 8, 9, 11, 22, 32, 35, 51, 57,
66–69, 74–76, 78, 88, 92–94
TLM/LT Loosely Timed TLM. 46
TTY Unix terminal. 57
UDP User Datagram Protocol. 44
UDP/IP UDP over IP stack. 21
UDS Unix domain sockets. 44, 70, 71, 78, 83, 88, 96
URI Uniform Resource Identifier. 70
VLSI Very-large-scale integration. 24
WAR write-after-read. 56
WAW write-after-write. 56
YAPSC Yet Another Parallel SystemC. 9, 11, 14, 22, 66, 67, 69–83, 89, 90, 92–96
Contents
1 Introduction 18
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Basic Concepts and Related Work 24
2.1 Eletronic System Level design . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Discrete Event Simulations . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.2 Instrumented simulation tools . . . . . . . . . . . . . . . . . . . . . 27
2.2 SystemC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 ArchC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . 45
2.6 Parallel SystemC Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.1 Parallelization Inside Cycles . . . . . . . . . . . . . . . . . . . . . . 46
2.6.2 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.3 Distributed Time / Relaxing Synchronizations . . . . . . . . . . . . 57
2.6.4 Without SystemC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.6.5 Parallelization Review . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.7 ESL and Simulators Parallelizations . . . . . . . . . . . . . . . . . . . . . . 65
3 Concurrent TLM-2.0 66
3.1 YAPSC_INIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.2 Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.3 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 YAPSC_MODULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 YAPSC_TARGET_SOCKET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 YAPSC_INITIATOR_SOCKET . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 YAPSC_START . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.1 TLM-2.0 Wrapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5.2 Payload Managing . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.3 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.4 Syscalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5.5 End of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 YAPSC_FINALIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.7 YAPSC_PAYLOAD_EXTENSION . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.8 Sequential SystemC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4 Experiments 82
4.1 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Conclusions 94
5.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 MPSoCBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.3 Offload of simulations . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.4 Diversity of communication channels . . . . . . . . . . . . . . . . . 96
5.2.5 New techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A WSCAD 2015 Paper 106
A.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.2 Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3 Introdução . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.3.1 Trabalhos Relacionados . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.4 TLM2 sobre TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.4.1 Exemplo: Sistema com 1 Processador . . . . . . . . . . . . . . . . . 113
A.4.2 Exemplo: NoC - Malha 2x2 com 2 processadores . . . . . . . . . . . 114
A.5 Avaliação Experimental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.5.1 Produtor - Consumidor . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.5.2 MPSoCBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.6 Conclusões . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Chapter 1
Introduction
Figure 1.1: ENIAC – first computer (U.S. Army photo)
The digital computer’s evolution, from the machines that occupied entire rooms (Fig-
ure 1.1) to the smartphones that fit in our pockets, show how the science and industry
bring smaller, faster, and lower-power components than the past generations.
Today, the key to computing are the transistors (Figure 1.2a) in integrated circuits,
which are responsible, for a given input, to propagate or not a value (Figure 1.2b). Logic
gates, as AND and OR (Figure 1.2c), are arrangement of transistors, and this gates
together implement the operations of a processor or other circuits.
The speed of a circuit is limited by how fast its set of transistors can respond an input
(frequency); the complexity of the operations that the circuit can perform; how many
operations the circuit can execute at same time. Thin transistors need less power and can
operate at a higher frequency than thick transistors. Furthermore, more thin transistors
fit in the same area, and this enables improvements in operations complexity, parallelism,
optimizations.
Gordon E. Moore, co-founder and Chairman Emeritus of Intel, early in 1965 said that
the number of components on a chip would double every year [6] (Figure 1.3a), being know
as Moore’s Law. After, in 1975, Moore revised his previous prediction to a slowdown from
18
CHAPTER 1. INTRODUCTION 19
(a) A NMOS Transistor – no
electrical current if the gate is
grounded
(b) A NMOS Transistor – Pos-
itive voltage on the gate induce
a electron path on the channel (c) CMOS OR gate
Figure 1.2: Transistors
12 to 18 months the time to double [7] (Figure 1.3b). In 1995, the law was revised again
to an average period of 2 years [7] (Figure 1.3b).
The fact is that the number of components inside a chip has grown fast and may keep
growing (Figure 1.4) while it is possible to reduce the transistor size. More transistors
can represent more cores or specific features in processors and already too many specific
functions were integrated into commonly called System-on-Chip (SoC).
(a) 1965: doubling each
year. [8]
(b) 1975: double each
18 months. [6] (c) 1995: double each 2 years. [7]
Figure 1.3: Moore’s Law
In last 15 years, designers stopped increasing processor’s frequency due to the limit of
heat dissipation capacity. The processors stabilize the performance improvements coming
from Dennard’s scaling [10, 11] (Figure 1.5). Thin transistors need less energy to operate
at a lower frequency, so the same chip area has more transistors without raising heat
production. The increasing quantity of transistors are commonly used to include more
cores on the processor chip, and the processor can execute more tasks simultaneously,
sold as faster processors. Today, there are processors with more than 100 cores inside of
one chip and researches to fit thousands.
To fit many cores in one chip requires understanding the overall computer systems:
core’s size, complexity, and quantity; others components included; and how all processor’s
CHAPTER 1. INTRODUCTION 20
●●●
● ● ●●●●●●● ●● ●
●● ●●●●
●
● ●
●
●
●●
●
●
●
● ●
● ●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
● ●●●● ●
●
●●●
●●
●●
● ●
● ●
●●
● ●
●
●●
●
●● ●●
●
● ● ●●●● ●●
● ●●
●●● ●●● ●● ●●●
●
●●● ●●●
● ●
●●
●
●
● ●●
●
● ●
●
●●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●●
1e+00
1e+03
1e+06
1e+09
1970 1980 1990 2000 2010
●Transitors Transitors Density (mm−2)  Feature size (µ m) Die size (mm2)
Moore's Law 1965 Moore's Law 1975 Moore's Law 1995
Figure 1.4: Transistors evolution on processors [9]
●●●●●●●
●●
●
●●●●● ●● ● ●●
●●●● ●●●
●
●●
●
● ●●●●
●●
●
●
●
●●●●
●●●●
●●●●
● ●●
●
●●
●●● ●
●
●
●●
●●●
●
●●●● ●
●
●●
●
●●●
● ●●● ●
●● ●●●●
●
●● ● ●● ●
●
●
● ● ●●●
●
●●● ●
● ●
●●
●●●●●● ●●
● ● ●●
● ●●●●
● ●
●
●
●
●
●● ●
●
●●
●● ●
●
●● ●●●
●●●●
● ●●● ●●●
●●
●
●
● ●●●
●
●●●
●
●
●
●
●●●
● ●
● ●●
●
●
●
●●
●●
●●●●
●
● ●
●
●●
●●●
●
●
●
●● ●
●● ●
●●
● ● ●
●●●
●●
● ●●
●●●●
●
●
●
●
●●
●
●
● ●
●
● ●
●
●●●● ●
●
●
1e+00
1e+03
1e+06
1e+09
1970 1980 1990 2000 2010
●Transitors Clock (KHz) Cores TDP per Die Size (W mm−2)
Power Wall
Figure 1.5: From frequency to multi-cores [9]
elements interact with each other. A bottleneck can emerge from any centralised or shared
resource in the system [12], following the Amdahl’s law [13] that describes how one not
optimised resource limits the theoretical speedup of the whole system.
The traditional communication between components in a chip is a shared bus, one set
of wires where a component broadcast the message to all others attached to the same
bus. Each one component needs to check if the transmitted data is intended for itself.
To avoid a conflict with two or more components sending data at the same time, they
must implement a bus arbitration to grant only one bus access at a time [14]. So, the
more components connected to the bus, greater the contention for this resource. An
usual solution uses multiple buses to interconnect components sharing the channel with
CHAPTER 1. INTRODUCTION 21
components, or even point-to-point between them. However, this mechanism requires
too many wires in the chip. The next possible step is to include routers connecting and
managing the communication between components.
Figure 1.6: NoC Example: Arbitrary number of processing units (PU) connected to
routers (R) by their network interfaces (NI).
Components can communicate using message passing through a NoC (Figure 1.6),
wires and routers connecting the components. Exactly as a distributed computer network
exchanging UDP over IP stack (UDP/IP) packages over a network. The messages need
metadata describing the source and the target, so the routers can direct the message to
correct target. Internally, they can choose predefined paths or do some load balance over
the options of paths to target. The overhead of many routings interactions, to achieve
the target and get a response, must be less than the number of conflicts on a traditional
unified bus. If all components try many requests to only one component, the connection
with this common target will compromise all performance of the whole system.
The development and exploration of this complex systems requires more and better
simulations to understand the entire SoC behavior before physical production. Simu-
lation is one important technique in many engineering processes to reduce the costs of
development of a new product, especially when exploring innovations. It avoids using
resources to produce physical prototypes and release candidates before system architects
can understand how the specified components impact the performance of all system for
each possible use.
Simulation is a solution to mitigate the costs and the time required to explore and de-
velop new processors architectures and experiment software in these future machines [15].
However, the poor simulation performance often limits the scope and depth of what to
test in the project. This limitation is particularly evident when simulators must multi-
plex cores of future manycore processors, with thousands of them, on the available cores
CHAPTER 1. INTRODUCTION 22
in current machines. Furthermore, the majority of simulators available does not use all
cores. They are limited to execute one or few tasks at a time.
Figure 1.7: DES loop: Fetch current events, process, stores new events, advances simula-
tion time.
Sequential Discrete Event Simulations (DES) uses the states of simulation, an event
queue, and a global clock [16]. Each event in the simulation has a time annotation
which will denote when to process this event in the global simulation progress. The DES
simulator advances the global clock managing the event queue. Removing the current
events from the queue, processing them, changing the affected states, and storing the new
events (Figure 1.7). The idea of parallel discrete event simulation (PDES) is not recent
but is not solved. From one side there are sequential simulators largely accepted by
industry and science and from another there are many others PDES simulators proposals
under tested solutions without consistent performance.
The cutting-edge researches are working in thousands of cores inside a chip, how to fit
they physically and how to program this amount to explore their potential performance.
Today, the industry sell options of processors with hundreds of cores, but many of current
simulators will not use this feature to develop the desired next generations of cores.
1.1 Contributions
This dissertation continue the presented work of Falcão et al. [17], which proposed a
SC_MODULE that encapsulate the TLM-2.0 with IPC over a SystemC simulation parted
in many processes using TCP over IP stack (TCP/IP). This work proposes YAPSC API
that allows the execution of the same simulation code in the unmodified SystemC kernel,
in sequential or concurrent mode, using many IPCs back ends.
Parallel Allows to run simulations with more processing power than the simulated sys-
tems, using a cluster or even all cores from the machine;
Simple Proof of concept that can be enhanced to specific contexts but works out-of-box;
CHAPTER 1. INTRODUCTION 23
Modular The required wrappers are (de-) attached without change the existent modules
and do the communication with minimal or none configuration;
Validation Friendly Don’t touch in simulation kernel, widely used and tested. Our
solution does not require that use a forked library or project;
1.2 Organization
This Dissertation is organized as follows:
Chapter 2 - Basic Concepts and Related Work introduces hardware simulation con-
cepts and tools, describes relevant simulator, and explore the prior work accelerating
simulations.
Chapter 3 - Concurrent TLM-2.0 introduces and describes our proposal.
Chapter 4 - Experiments aims to provide the tests and results archived with our pro-
posal.
Chapter 5 - Conclusions concludes this text reviewing the contributions.
Chapter 2
Basic Concepts and Related Work
This chapter describes the basic concepts and tools related to our research. Section 2.1
introduces the Eletronic System Level (ESL) design, modeling and simulation; Section 2.2
describes the SystemC, and its reference implementation, an IEEE standard for ESL that
is the main application of the work of this dissertation; Section 2.3 presents the ADL that
simplify the process of creating a processor model and its software tools, this ADL allows
to instantiate hundreds of cores and motivate our desire for faster simulations using the
SystemC; Section 2.4 exposes the benchmark suites used to test and evaluate our work;
After, Section 2.5 lists and describes the options of IPC, the approach employed by this
work to communicate the individual parts of the simulation.
Section 2.6 reviews, from more than a decade now, the proposals for SystemC paral-
lelization in chronological order.
2.1 Eletronic System Level design
In the mid-nineties, System-on-Chips (SoCs) became a reality as result of the long evo-
lution of Very-large-scale integration (VLSI) technology. The growing complexity of the
integrated circuits impact in time-to-market, enforced the attention to the follow notions:
Abstraction: The hardware representation level follows the complexity of the project.
Current SoCs use a functional/behavioral level instead of the lower levels: register-
transfer, gate, circuit or physical.
Reuse: The previous components should be reusable in new designs. The reusability
reduces the tests requirements compared to ground developed modules. The capa-
bility of improving a past version of a component is essential to its reuse, which can
reflect the abstraction choice: lower level designs are less flexible to improvements
over the time.
Automation: Automatic tools reduce errors and the time to market. The complexity
and the number of steps in the manufacture of electronic systems require automatic
tools to synthesis, physical cells allocation and verification. Designers rely on syn-
thesis tools that refine the representation from higher to lower abstraction levels; on
24
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 25
Electronic Design Automation (EDA) tools to design physical integrated circuits;
and an infrastructure of verification tools.
Exploration: Designers should have concerns about the area, performance, and power,
of their integrated circuits. The possibility of analyzing alternative solutions for a
given model increase the chances of meeting the project requirements. Moreover,
refined lower levels of abstraction reduce the amount of alternatives for one model.
From past gate level to common RTL, the current complexity of developed SoCs
requires even higher abstraction levels. System-level representations with functional be-
haviour enable tools that automatically provide both hardware and software solutions to
the designed system.
Figure 2.1: An idealized top-down ESL design flow [1]
An idealized ESL design uses a natural language specification of the system (Fig-
ure 2.1). The system is modeled using this specification, and before the hardware or
software model is created, an executable module is produced and used to test if the initial
requirements of the desired system are met: functional behaviour, performance, storage
space, power, and communication traffic.
The pre-partition analysis requires an interactive specification modeling. The exe-
cutable model is written to meet the functional description of the model, focused on fast
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 26
analysis and validation. This dissertation works in this step of the ESL design, where
fast preliminary analysis provides an early model ready to be partitioned in hardware and
software models.
2.1.1 Discrete Event Simulations
DES describe the functional behaviour for an executable module. A discrete event simula-
tion model assumes the simulated system only change states in discrete points of simulated
time [16]. An event occur in discrete time and can generate future events. For example,
a cache memory receives a data request in t = x and schedule the answer to t = x + r,
where r is the expected time that this cache needs to read its internal structure.
Typically, DES utilizes three data structures [16]: the state variables that describe
current state, an event list with the pendent events, and a global clock. The executable
main loop works removing the next events from the queue, processing it, updating the
current state and incrementing the global clock (Figure 1.7).
Figure 2.2: DES loop: Fetch current events, process, stores new events, advances simula-
tion time. (Same as Figure 1.7)
Parallel Discrete Event Simulations (PDES), or distributed DES, tries to process
events in parallel. The main difficult to execute these events in parallel is to keep their
data consistent. If the queue has any synchronization problem, an event with larger times-
tamp can execute before an event with smaller ones, which is clear a data hazard to state
variables.
A PDES can implement a conservative or optimistic behaviour. The first approaches
adopted in literature are the conservative ones, where all the time constraints are strictly
fulfilled [18]. The simulation only process an event if all the events with lower timestamp
are done.
In general, PDES strategies consist of logical processes (LPs) that represent a process
of the modeled system (Figure 2.3). These LPs are essentially self-contained discrete
event simulations. This approach requires strong synchronization between the LPs, but
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 27
simplify the state management of the full simulation. Executing all events in global time
restricts order avoids data hazards. Each LP has a global clock and synchronize then
when sending messages to others LPs. The system can achieve a deadlock stage if the
simulation stays a long time without communicating with others, this requires sending
null messages to keep the clock synchronized and advancing.
Figure 2.3: Distributed system: This figure shows an example of distributed system with
3 Logical Processes (LPs). LP1 generates a signal A that is evaluated by LP2, and a
signal B that is evaluated by LP3. LP2 receives signal C from LP3 and generates signal
D that is received by LP1. [19]
The optimistic approach relax the time synchronization using a time quantum that
enables the simulation to process future events. The simulator expects that the selected
future events do not conflict with the not yet executed events with lower timestamp. If
any conflict is detected, the simulation must postpone the event or rollback the state if
the detection happens after the event processing.
2.1.2 Instrumented simulation tools
A binary translator converts a binary software from one system to another, and can
be implemented as Static Binary Translation or Dynamic Binary Translation (DBT).
Instead of interpreting each instruction, the translator reads, decodes and converts a
block of instructions. Finally, the translated binary executes on the target host as a
binary originally compiled to this host.
Pin [20] is a dynamic instrumentation framework for profiling, performance evaluating,
and bug detection. This tool provides an efficient instrumentation by using a just-in-time
(JIT) compiler to insert and optimize code that can use the processor internal performance
counters (Figure 2.4). The instrumentation is transparent and preserves the addresses and
the memory values. Like a debugger, Pin can attach, get the desired information, and
eventually detach. So, the tool allows sampling executable only in slices of time.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 28
Figure 2.4: Pin’s software architecture [20]
An architecture simulator can use DBT to instrument binaries and simulate them
in current hardware with the desirable behavior. Many of them [21, 15, 22, 23] use
Pin framework to instrument the software to simulate the behavior of this software in
a modified system. With the dynamic simulation, this simulators achieve higher speed
compared with the full interpreted simulators.
The current version of Pin only supports IA-32 and x86-64 instruction set architec-
tures. This proprietary tool requires both the binary software and host hardware be the
same and one of the only two supported. These requirements create a very limited scope
of which systems can be simulated using Pin.
2.2 SystemC
SLDLs provide a collection of libraries of data types, kernel simulation, and components
for high-level system modeling and simulation. SystemC [5] is a SLDL, based on C++,
used for modeling and verification of systems at different abstraction levels. Currently,
SystemC is one of the most important languages for ESL design [1] and has become the
leading choice of designers of SoCs and embedded processors [24].
SystemC is a collection of C++ classes and templates that provides a sequential dis-
crete event simulation framework. All simulated components are described as SystemC
modules, and any communication between them are simple method call. For example, a
PowerPC processor can connect to read-write memory as in Listing 2.1.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 29
1 int sc_main(int argc, char **argv) {
2 // Phase 1 : ELABORATION
3
4 // An instance of a PowerPC processor
5 ppc PPC("PPC");
6 // An instance of a read-write memory
7 mem MEM("MEM");
8
9 // Connection between PPC`s and MEM`s data ports
10 PPC.DM_port(MEM.target_export);
11
12 // Phase 2 : SIMULATION
13 sc_start();
14
15 // Phase 3 : DIAGNOSIS
16 PPC.PrintStats();
17
18 return PPC.exit_status;
19 }
Listing 2.1: SystemC top-level description of a minimalist system [1]
SystemC simulations are divided into three phases: pre-simulation phase (elaboration),
main loop (simulation), and pos-simulation (diagnosis). The elaboration phase is where
all system components are instantiated, configured, and connected. All components must
be instantiated in this phase, and the simulation fail to start if any port is not bound at
the end of elaboration phase.
The simulation itself happens in SystemC main loop when the events are processed
in order, following the DES. After the elaboration phase, the simulation evaluates events,
update states and notify modules (Figure 2.5a). Initially, the SystemC scheduler made
all processes runnable and executed only one per time. Internally, all SystemC processes
and threads are implemented using user-level threads (QuickThreads), or using system-
level threads (pthreads). One process executes until it returns or waits for an event
to occur. The current process can generate new events from three types: immediate,
delta, and timed (Figure 2.5b). When immediate events occurs, all processes sensitive to
these events are made runnable immediately. Delta events are processed after all current
runnable events but in same timestamp. After all delta events executed, the global clock
advances and the timed event with lower timestamp will execute.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 30
(a) Simplified loop.
(Similar to DES loop,
figure 1.7) (b) Scheduler [25]
Figure 2.5: SystemC
SystemC is an interesting ESL design tool when considering the goals described in
Section 2.1. SystemC supports multiple abstraction levels, from the common RTL to a
higher level called TLM. Each component in SystemC is a C++ class, and their instances
are runtime objects. SystemC modules are reusable with simple component binding. The
adopted language to describe functionality, C++, is broadly used and should not become
a problem to designers.
The executable model generated from SystemC specification is a system binary that
executes the desired tests producing the results of the model soon as possible in the
design process. The SystemC system model can keep the abstraction above RTL during
the early stages of the project and reach the RTL in synthesis stage after successive stages
of refinement.
The initial high abstraction level gives an early and fast functional model that meet
the project requirements. Moreover, enable a variable exploration of refined models to ex-
periment area, performance, and power results; all of them must follow the same response
of the executable model.
A limiting factor in accelerating the simulation of systems modeled in SystemC is that
the SystemC kernel is sequential and based on DES, as discussed in Section 2.1. When
simulating multi-core systems, it spends around 50% of the time with the kernel and
leaves around 30% to all core modules behaviour [26]. Peripherals modules (like routers
and memories) and system calls use the remaining executing time. Consequently, the
simulator executes a module (event) at a time, even if the hardware supports execution
of concurrent processes [24]. For example, a virtual platform containing 16 processors
running in an actual multi-core architecture uses only one core to execute the simulation,
scheduling the processors and other peripherals to execute one at a time.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 31
Modeling
A SystemC module is represented by a C++ class, their connection ports, and their
behavior. A module must extend the superclass sc_core::sc_module which defines the
super class constructor, port binding support, signal bidding support, the module name,
and triggers to elaboration and simulation phases start and end. The library provides
SC_MODULE(), a macro to declare a new class that extends the sc_module. The macro
SC_CTOR() defines the current class type and declare the module’s constructor. For
example, an and logical port is modeled in RTL setting the signals and the behaviour in
C++ as in Listing 2.2.
1 #include "systemc.h"
2
3 SC_MODULE(and_port) // declare the module
4 {
5 sc_in<bool> A, B; // input signal ports
6 sc_out<bool> F; // output signal ports
7
8 void do_and() // a C++ function
9 {
10 F.write(A.read() && B.read());
11 }
12
13 SC_CTOR(and_port) // constructor for and_port
14 {
15 SC_METHOD(do_and); // register do_and with kernel
16 sensitive << A << B; // sensitivity list
17 }
18 };
Listing 2.2: Example of RTL and port in SystemC
Transaction Level Modeling (TLM)
The TLM methodology represents an intermediate level of abstraction between paper
specification and RTL models. This abstraction is a key level in early stages of ESL
design and explains the role of SystemC and TLM in ESL technologies.
By using higher levels of abstraction, TLM improves simulation speed and modeling
productivity. This methodology avoids unnecessary details at the early phases of the
design flow. The goal is to provide an early reference model to teams working on soft-
ware, hardware, architecture analysis, and verification. So, the target use of TLM is the
executable model in the design flow.
TLM abstracts pin-level communication in the physical model to a higher level protocol
of word/frame transactions. The communication is a packet (called Payload) exchanging
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 32
among modules, which are called transactions, using the Interface Method Calls (IMC)
approach. The main focus of TLM is communication to memory-based buses models.
Moreover, it accepts both loosely-timed and approximately-timed simulation using time
management and annotation of events. The payload (Figure 2.6) messages contain the
request, extensions to support more buses models, and a pointer to memory data on host
memory.
TLM-2.0 extends to the previous version with asynchronous communication. A com-
ponent does not need block waiting for an answer, and can continue the simulation. For
example, a cache memory can pre-fetch data and continue accepting data requests from
the processor.
Figure 2.6: Diagram of TLM payload
SystemC modules that initiate TLM transactions are called initiators, and the modules
that receive them from an initiator are called targets. Each payload exchange needs the
specific socket pair, an initiator socket to a target socket when requesting and the opposite
when answering. TLM-2.0 sockets are bi-directional, but one initiator socket expects to
receive only answers with completed payloads. All sockets pairs must be connected to
another socket of the opposite type, before the end of elaboration phase.
The simulation of a simple system (Figure 2.7) that is composed by a processor module
(Listing 2.3) and a memory (Listing 2.4) is done connecting them directly by TLM-
2.0 (Listing 2.5). Each read/write must be encapsulated in a payload that contains
all requisition data and passed from core’s initiator socket to memory’s target socket.
Asynchronously, the request will be processed by memory and answered from its target
socket to the requester. In case of a read command, the data must be written to data
pointer inside of payload request.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 33
1 SC_MODULE(processor) {
2 tlm_utils::simple_initiator_socket<processor> socket;
3 sc_event transaction_done;
4
5 tlm_sync_enum nb_transport_bw(tlm_generic_payload & payload,
6 tlm_phase & phase, sc_time & delay) {
7 peq.notify(payload, delay);
8 phase = END_REQ;
9 return TLM_COMPLETED;
10 }
11
12 void proc() {
13 tlm_generic_payload *payload;
14 unsigned char data[4];
15 while (true) { /* Processor Simulation */
16 /* ... */
17 if (is_memory_request) {
18 payload = new tlm_generic_payload();
19 payload->set_address(addr);
20 payload->set_data_length(4);
21 payload->set_data_ptr(data);
22 if (is_write)
23 payload->set_write();
24 socket->nb_transport_fw(std::ref(*payload), BEGIN_REQ,
25 sc_time(DELAY, SC_NS));
26 wait(peq.get_event());
27 payload = peq.get_next_transaction();
28 /* ... */
29 }
30 /* ... */
31 }
32 }
33
34 processor(sc_core::sc_module_name)
35 : sc_module(module_name), socket("socket"), peq("peq") {
36 SC_THREAD(proc);
37 socket.register_nb_transport_bw(this,
&processor::nb_transport_bw);↪→
38 }
39 };
Listing 2.3: Generic layout of a TLM2 processor
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 34
1 SC_MODULE(memory) {
2 tlm_utils::simple_target_socket<memory> socket;
3 peq_with_get<tlm_generic_payload> peq;
4
5 tlm_sync_enum nb_transport_fw(tlm_generic_payload & payload,
6 tlm_phase & phase, sc_time & delay) {
7 peq.notify(payload, delay);
8 phase = END_REQ;
9 return TLM_UPDATED;
10 }
11
12 void proc() {
13 tlm_generic_payload *payload;
14 while (true) {
15 wait(peq.get_event());
16 payload = peq.get_next_transaction();
17 tlm_command cmd = payload->get_command();
18 unsigned int addr = payload->get_address();
19 unsigned char *ptr = payload->get_data_ptr();
20 unsigned int len = payload->get_data_length();
21
22 switch (cmd) {
23 case TLM_READ_COMMAND:
24 memcpy(ptr, &(this->data[addr]), len);
25 break;
26 case TLM_WRITE_COMMAND:
27 memcpy(&(this->data[addr]), ptr, len);
28 break;
29 }
30 payload->set_response_status(TLM_OK_RESPONSE);
31 socket->nb_transport_bw(*payload, BEGIN_RESP, sc_time(DELAY,
SC_NS));↪→
32 }
33 }
34
35 memory(sc_module_name module_name)
36 : sc_module(module_name), socket("socket"), peq("peq") {
37 SC_THREAD(proc);
38 socket.register_nb_transport_fw(this, &memory::nb_transport_fw);
39 }
40 }
Listing 2.4: Generic layout of a TLM2 memory
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 35
1 processor PROC = new processor("processor");
2 memory MEM = new memory("memory");
3
4 PROC->socket.bind(MEM->socket);
5
6 sc_start();
7
8 return (0);
Listing 2.5: Generic layout of a processor and memory bind with TLM2 sockets
Figure 2.7: Diagram of TLM-2.0 communication between core and memory
This work focus in the early stage of the executable model on ESL design flow, when
more functional changes are expected and have high interaction between specifications,
modeling, and tests. TLM-2.0 fits at this stage and provides the asynchronous modules
communication desirable to PDES.
2.3 ArchC
In ESL design flow, the development and exploration of new processor models require
producing, in early stages, the software toolkit (assembler, linker, compiler, debugger)
for each distinct processor [27]. The software toolkit is commonly generated from an
Architecture Description Language (ADL). Although the generation of an Instruction-Set
Simulator (ISS) is a common target of ADLs, the cooperation of ISSs to build a SoC is
not common.
ArchC [28, 29] is an ADL that follow SystemC syntax style for processor models
for system and platform representations [26, 30, 31]. The goal is to provide a high-
level abstraction to specify the processor model, allowing exploration and verification
of new and legacy processor’s architectures by generating both software tools and an
executable model (ISS) for the desired platform (Figure 2.8) [32, 33, 34]. The simplicity
and flexibility, showed that this ADL is an useful tool not only for research, but also for
computer architecture education [35].
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 36
Figure 2.8: ADL-based (ArchC) exploration flow [1]
From the ADL description, ArchC synthesizes the compiler’s backend [36, 37, 38], a
SystemC-based ISS [39], and binary utilities [40]. The set of generated tools from the same
ADL gives to the user an opportunity to interact changes in specification and validation
tests with real compiled software to achieve their requirements.
ArchC description is split into Instruction Set Architecture (ISA) (AC_ISA) and ar-
chitecture resources (AC_ARCH) separated descriptions [41, 42]. The AC_ISA (List-
ing 2.6) description has the information about the architecture instructions, with their
binary representation (lengths and fields) and the behavior of each instruction in C++
language. The AC_ARCH (Listing 2.7) description contains the processor organization:
storage devices, pipeline structure, endian etc. With AC_ARCH and AC_ISA the ArchC
generate the set of tools described early.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 37
1 AC_ISA(mips) {
2 ac_format Type_R = "%op:6 %rs:5 %rt:5 %rd:5 %shamt:5 %func:6";
3 ac_format Type_I = "%op:6 %rs:5 %rt:5 %imm:16:s";
4 ac_format Type_J = "%op:6 %addr:26";
5 ac_instr<Type_I> lb, lbu, lh, lhu, lw, lwl, lwr;
6 ac_instr<Type_R> add, addu, sub, subu, slt, sltu;
7 ac_instr<Type_J> j, jal;
8 /* ... */
9 ac_asm_map reg {
10 "$"[0..31] = [0..31];
11 "$zero" = 0;
12 "$at" = 1;
13 "$kt"[0..1] = [26..27];
14 "$gp" = 28;
15 "$sp" = 29;
16 "$fp" = 30;
17 "$ra" = 31;
18 }
19
20 ISA_CTOR(mips) {
21 lb.set_asm("lb %reg, \%lo(%exp)(%reg)", rt, imm, rs);
22 lb.set_asm("lb %reg, (%reg)", rt, rs, imm = 0);
23 lb.set_asm("lb %reg, %imm (%reg)", rt, imm, rs);
24 lb.set_decoder(op = 0x20);
25
26 addi.set_asm("addi %reg, %reg, %exp", rt, rs, imm);
27 addi.set_asm("add %reg, %reg, %exp", rt, rs, imm);
28 addi.set_asm("add %reg, $0, %exp", rt, imm, rs = 0);
29 addi.set_decoder(op = 0x08);
30 addi.set_cycles(4);
31 /* ... */
32 sys_call.set_asm("syscall");
33 sys_call.set_decoder(op = 0x00, func = 0x0C);
34 /* ... */
35 pseudo_instr("li %reg, %imm") {
36 "lui %0, \%hi(%1)";
37 "ori %0, %0, %1";
38 }
39 /* ... */
40 };
41 };
Listing 2.6: Excerpt of the MIPS AC_ISA Description
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 38
1 AC_ARCH(mips) {
2
3 ac_mem DM : 512M;
4 ac_regbank RB : 32;
5 ac_reg npc;
6 ac_reg hi, lo;
7 ac_reg id;
8 ac_wordsize 32;
9
10 ARCH_CTOR(mips) {
11
12 ac_isa("mips_isa.ac");
13 set_endian("big");
14 };
15 };
Listing 2.7: Excerpt of the MIPS AC_ARCH Description
The AC_ARCH can describe a full system of a processor and memory or can produce a
SystemC module with TLM bindings, synchronous and asynchronous. Used as SystemC
module, the ArchC processor model allows integration within bigger and custom SoCs
designs.
The project maintains four official processor models, each one with a different ISA,
and provides the necessary toolchains to each one: ARM, Power, MIPS, and SPARC.
The official models offer processors example to run a code, test processors modification,
and serves as startup code to explore new processors.
In AC_ARCH description, users can specify cache configuration [43]. The last memory
hierarchy level is a read-write memory component or TLM sockets to connect to a system
that must answer the read and write requests using TLM standard.
In the generated ISS, ArchC models emulate some linux syscalls [44]. The guest soft-
ware can interact with real filesystem, request dynamic allocated memory, and more. The
SystemC executable module run a batch of instructions before calling SystemC scheduler.
This batch is a flexible way to execute lookahead instructions in optimistic behavior and
avoid many SystemC’s kernel module execution swaps. When the processor needs to in-
teract with others modules, e.g., memory, it calls the target’s socket forward method, and
it blocks waiting for the answer through its socket backward method.
The ArchC ADL can provide power estimation of processor models [45, 46, 47, 48].
From the ArchC model, a tool extracts and automatically generates a testbench that will
run in a RTL model of the same processor by third party commercial synthesis tools. The
result is the power characterization of the processor that is used by ArchC executable
model to report the power usage of the target processor.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 39
2.4 Benchmarks
The MPSoCBench [49] is a simulation toolset composed of a scalable set of Multipro-
cessor System-on-Chips (MPSoCs) to enable development and evaluation of new tools,
methodologies, parallel software, and hardware components. It provides a high level way
to specify 2D NoCs with up to 64 processors and others components.
Based on ArchC [29, 50], processors are generated from the ADL descriptions and
emulated at system level. MPSoCBench includes the four official supported models from
ArchC using the TLM version of them. The processor’s models have their timing anno-
tation of each instruction to improve simulation time precision. All processor share the
same memory address space and one SystemC kernel that schedules all modules, one each
time.
The toolset offers two memory modules; the first is a simple read-write SystemC
TLM module and the other is a cycle accurate model that is able explore the memory
performance and energy consumption based on DDR memory characterization library.
Any access to memory from a processor must be routed through the NoC from requester
to the memory component.
Syscalls are called with shared guest memory, and they have to be consistent. For
example, one file descriptor from open syscall on processor 1 must be valid to a read
syscall in processor 2.
MPSoCBench connects the last level cache of the ADL model to one shared direc-
tory component, which is a device that implements the MSI cache coherence algorithm.
Without the cache coherence, the parallel software must use another synchronization
techniques, requiring a dedicated hardware component and explicitly used by software
guest.
Figure 2.9: A MPSoCBench platform designed as a mesh consisting of processors, memory
and IPs
Figure 2.9 shows an example of a platform built with MPSoCBench. There are eight
cores with instruction caches, connected to NoC using a TLM wrapper module. One
memory module simulates the memory controller and memory. A lock device provides
a mutual exclusion hardware support when the system is not using the cache coherence.
The dynamic frequency scaling (DFS) module manages the cores power consumption and
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 40
performance. Also, an interruption control manages (Intr_Ctrl) and notifies external
events.
2.5 IPC
Figure 2.10: A taxonomy of Unix IPC facilities [51]
Processes executing in the operating system (OS) can be classified as independent or
cooperating processes [52]. An independent process cannot affect or be affected by others
processes, so they do not share data. Any process that shares data with others is a
cooperating process; the motivation to cooperation can be to share information, parallel
computation, modularity or convenience.
Modern OSs, in addition to process scheduling and resource allocation [53], must
support Inter-Process Communication (IPC) mechanisms to cooperating processes (Fig-
ure 2.10). There are two fundamental models of communication: shared memory and data
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 41
transfer. In shared memory model the processes exchange information reading and writ-
ing in a shared memory region. In data transfer model, the communication mechanism
receives the information from one process and passes to another.
To select an IPC, the developer should analyze the transfer buffer sizes, data transfer
mechanisms, memory allocation schemes, locking mechanism implementations, and even
code complexity. The performance regarding throughput can vary significantly on the
same platform.
2.5.1 Shared Memory
Figure 2.11: Two processes with a shared mapping of the same memory region [51]
On multi-thread software, all threads have direct access to all pointers because they
have the same memory address space. Memory mapping enables the creation of shared
regions between different processes so that they can use the techniques from multi-thread
programming (Figure 2.11). Multiple processes can share explicitly the same memory
region, which requires synchronized access to the shared region. Cache coherence must
act if two or more processes write to the same region of shared memory.
Shared Memory is the fastest form of IPC [52] because the data is visible and acces-
sible to all processes without requiring copies, transmissions, or any syscall. This speed
advantage is offset by the need to synchronize operations, the typical solution for GNU
Linux use mutex1 to promote mutual exclusion when accessing shared data.
Shared Memory requires an access control for concurrent write operations to avoid
inconsistent values. The consistency control can be managed using structs of semaphores
1The glibc implementation allows process-shared mutexes, the portable Unix solution is
Semaphores. [51]
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 42
or mutex to restrict the access to a critical area [51]; use atomic operations to execute more
than one operation in memory and ensure consistent reads; alternatively, speculatively
access the memory and check for consistent problems.
Using the shared address space between threads is easiest than the mapped memory.
In multi-thread, all threads share the same memory address space (Figure 2.12b). Any
newly allocated data is instantly accessible to other threads [51]. In mapped memory, the
programmer must modify the allocator to use the specific memory region or copy the data
to share. The memory region is, by default, placed in a nonpredefined virtual address,
so any pointer stored in this regions must store the offset to the base address of mapped
region.
(a) Typical process (b) Four threads executing
Figure 2.12: Memory address space for Linux/x86-32 [51]
Developers can even simulate many other IPC types with shared memory. However,
it does not scale to distributed environments. Libraries can implement a shared memory
over other types of communication [54], but this requires one more layer which includes
the overhead of managing the coherence of all shared regions.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 43
2.5.2 Sockets
Sockets are a method of data transfer IPC [51], either on the same host or different hosts
connected by a network. In a typical client-server scenario, each process creates their
socket. The server binds its socket to an address and clients uses this address to locate it.
The Unix OSs use the socket() syscall to create a socket and return its file descriptor.
They support at least three communication domains: Unix (AD_UNIX), IPv4 (AF_INET)
and IPv6 (AF_INET6); and two types: stream or datagram.
The stream sockets (SOCK_STREAM, Figure 2.13a) are connection-oriented, reliable and
bidirectional. All transmitted data is sent as a stream without message boundaries. In
normal conditions1, the transmitted data is guaranteed in-order delivery. The stream
sockets are often distinguished by their behaviors, active or passive. The passive socket
bind to address and wait for connections and the active ones are the responsible to request
the connection.
The datagram sockets (SOCK_DGRAM, Figure 2.13b) are a connectionless socket that
allows sending one message with preserved boundaries, but without any guarantee of
delivery. One socket only needs to know the address of target socket. When a message
is received, the process gets the sender address along with the message content, which
allows a response message.
(a) Stream sockets (b) Datagram Sockets
Figure 2.13: Overview of system calls used with sockets [51]
1No connection lost.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 44
Internet Sockets
There are two types of Internet domain sockets: IPv4 and IPv6 IPC [51]. They are
similar, and both allow message exchange between multiple processes over hosts. The
major difference is the addressing; the IPv4 has 32 bits address, and the IPv6 has 128
bits address. Both support loopback communication; each one has a special address that
is interpreted as localhost.
The two protocols used with Internet sockets are the Transmission Control Protocol
(TCP) and the User Datagram Protocol (UDP) [51]. The UDP is the protocol for data-
gram sockets and, as presented in the last section, does not offer any reliability in the
protocol layer. The TCP is the protocol for stream sockets and includes packaging data
in segments, sequencing, and congestion control. TCP store data in buffers to group mul-
tiple data in an Internet Protocol (IP) datagram reducing the number of communication
(Figure 2.14). The received data is stored in respective buffers and accessed by the process
respecting the transmission sequence and without data lost.
Figure 2.14: Connected TCP sockets [51]
The performance of TCP/IP sockets includes a stack of abstraction layers, as the
transport and network layers, that include some overhead in communication even on the
same host. The loopback communication using TCP/IP typical performs slower than
local-only IPCs. However, some hosts, as the Sun’s UltraSPARC (an intensive network
specialized), system optimizations allow faster loopback communication [53].
Unix Domain Sockets
The Unix domain sockets (UDS) provide specific localhost communication mechanisms,
using similar code style from the Internet sockets, seen in the last section, but for data
transfer between processes executing on the same host operating system. The address
of a UDS socket is an abstract name or a pathname in the file system. Abstract names
are managed by the kernel and do not require the knowledge of the file system organiza-
tion [55]. If the address is a pathname, the system creates an entry in the file system of
socket type (S_IFSOCK) and with the defined user access permissions [51]. UDS delivers
higher throughput than others data transfer IPCs [53].
The UDS sockets supports both the stream and datagram types. All UDS sockets are
reliable, but the order of messages is not guaranteed in the default datagram type (SOCK_-
DGRAM). Also, it supports an ordered version of a datagram type (SOCK_SEQPACKET) [55].
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 45
2.5.3 Message Passing Interface (MPI)
Message Passing Interface (MPI) is the industry-standard for message-passing programs [2].
The specification was written targeting efficiency, flexibility, and portability. The standard
consolidated research advances into novel features that extend existing practice to a new
generation of applications [56]. However, the MPI’s model of process/group, that assumes
similarity between all processes, is not always appropriate for heterogeneous systems [57].
MPI is not a new language; the standard defines a common interface to write appli-
cations using this protocol. There are many implementations, many of them are open-
source. Each one supports a set of backend channels and network connections. The two
major open-source implementations, OpenMPI and MPICH, support local or network
connections.
Code using MPI must import and link an implementation to its binaries. The imple-
mentation is responsible for executing the multiple instances of this same binary. So, the
selection of a library implementation is related to the target platform where the solution
will execute. Each instance has one rank number (MPI_Comm_rank) from 0 to the total
number of processes (MPI_Comm_size), and they use this rank number to specialize their
behavior (Listing 2.8).
1 #include <mpi.h>
2 int main(void) {
3 char m[100];
4 int comm_sz; /* Number of processes */
5 int my_rank; /* My process rank */
6 MPI_Init(NULL, NULL);
7 MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
8 MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
9 if (my_rank != 0) {
10 sprintf(m, "Process %d of %d!", my_rank, comm_sz);
11 MPI_Send(m, strlen(m) + 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
12 } else {
13 printf("Process %d of %d!\n", my_rank, comm_sz);
14 for (int q = 1; q < comm_sz; q++) {
15 MPI_Recv(m, 100, MPI_CHAR, q, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);↪→
16 printf("%s\n", m);
17 }
18 }
19 MPI_Finalize();
20 return 0;
21 } /* main */
Listing 2.8: MPI program that prints greetings from the processes [2]
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 46
2.6 Parallel SystemC Simulations
Transaction Level Modeling (TLM) of systems-on-chip in SystemC are commonly used
in the industry to provide an early simulation environment [18]. However, the standard
implementation requires determinism and reproducibility of simulations, making sequen-
tial implementation still the reference. Parallelization of SystemC simulations is a major
research topic [18], because the increasing size and complexity of models, and the mul-
tiplication of computation cores on recent machines and clusters: “The parallelization of
SystemC simulation is not straightforward, and is a major research concern.” Becker et al.
[18] - 2016
Becker et al. [18] present an industrial context based in Loosely Timed TLM (TLM/LT)
models. However, most of SystemC parallelizations are limited to cycle-accurate models
or do not cover the characteristics of actual industrial models. PDES solutions explicit
targets RTL platforms because the low level signals communications that are not used in
TLM models. Parallelization inside cycles, using the delta cycle or timed cycle, requires a
barrier at the end of the cycle to synchronize and target only simplistic models. Solutions
that use dependency analysis, besides the limitation to parallelism inside cycles, can not
perform the static analyses on industrial size and complex models. Relaxing synchroniza-
tions can reach good performances on MPSoC and NoC, but can not use a big number
of cores to reach the best speedup. One interesting proposal is an alternative transaction
specified by a time duration, which gives the flexibility to compute the task in any time
between the request and the result time. However, even with substantial performance,
the amount of effort to adopt a solution may not be profitable.
Entirely transparent solutions do not require any modification of the simulator’s code,
the solution is responsible by automatically applying its optimizations. In this case, the
designer only needs to adopt a new implementation of SystemC or an extension library.
However, many of the proposed solutions require code modifications. Some of them only
require changes in elaboration phase, when connecting the modules. So, the solution
is transparent to modules that are untouched and keep their original behavior. Other
solutions will require a partial rewrite of module’s code to give more information to the
framework, use a new API, or meet any limitation imposed.
This section reviews existing approaches to parallel SystemC simulations. Each of
them is organized by used technique: parallelization inside cycles (Section 2.6.1), de-
pendency analysis (Section 2.6.2), distributed time and relaxing synchronization (Sec-
tion 2.6.3). Section 2.6.4 reviews the parallel simulation approaches that are not based in
SystemC.
2.6.1 Parallelization Inside Cycles
The first attempt of parallelizing simulations with SystemC comes from the conserva-
tive approach of PDES. The solutions in this section try from conservative to optimistic
approach to parallelize the evaluation in the same cycle, delta or time cycle.
Trams [3] proposed a C++ class library to increment the SystemC reference imple-
mentation. They require a manual split of SystemC process in each RTL module, also
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 47
require explicit connect the signals between modules using their TCP/IP API in elabora-
tion phase (Listing 2.9).
This model frees the programmer from knowing the synchronization details, also is
transparent to SystemC modules. However, it is not an entirely transparent solution.
The designer must explicitly adapt the system elaboration phase to the proposed library.
After connections, elaboration ends with a call to a function to wait for all inbound and
outbound connections.
The model follows the concept of conservative PDES, where channels connect the
logical processes (partitions). Each partition waits all past messages to arrive before
continuing the simulation and uses null messages to keep timing synchronized.
1 int sc_main(int argc, char **argv) {
2 int st;
3 sc_signal<double> reg out;
4 sc_signal<double> reg in;
5 sc_signal<double> const out;
6
7 // Instantiate an outbound TCP/IP sync module (designator=1)
8 // to connect localhost:10011.
9
10 sc_dfsync outbound_module(&st, TCP, SYNCOUT, 1, "localhost", 10011);
11 // Attach our two outbound signals and specify periods ( in us ).
12 outbound_module.attach(1, 100, &reg_out);
13 outbound_module.attach(2, 1000, &const_out);
14
15 sc_dfsync inbound_module(&st, TCP, SYNC IN, 1); // Inbound module.
16 inbound module.attach(1, &reg_in); // Attach only one signal.
17
18 sc_dfsync_conall(TCP, 10010); // Outbound connect to port 10010.
19 /* ... */
20 sc_start(); // Start the simulation
21 return 0;
22 }
Listing 2.9: Connection of two RTL routers using Trams [3]
Galiano et al. [19] compare two literature proposals that do not modify the SystemC
kernel, a MPI implementation over the work of Trams [3]. Their tests using two Reduced
Instruction Set Computers (RISCs) processor platforms show slight differences between
these two implementations, MPI got lower simulation time in all CPU loads tested. Test-
ing only the MPI approach, for platforms containing more cores showed that high guest
CPU loads get better speedups.
Galiano et al. [19] highlights how the MPI solution can not fit in real industry models
with complex C++ data types over distributed processes. Their work adopted a serial-
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 48
ization of this complex data using Boost C++ library and changed the communication
to Boost’s MPI implementation. The new solution, supporting complex data, had extra
overhead over the MPI raw solution while still performing better than original SystemC
results.
Cox [58] explores PDES techniques and present a distributed version of the SystemC
environment, called RITSim. The proposal was committed to transparency and compati-
bility with previously written SystemC modules but requires minimal changes in modules
source code. The author implements modifications to the SystemC kernel but highlights
how modification compromises the maintainability of the new kernel. So, all necessary
modifications avoid changing the core of SystemC reference implementation to promote
future improvements.
RITSim uses the PDES, where each LP contains a top-level module that encapsulates
RTL SC_MODULEs and restricts data share with modules external to the top-level domain
(Figure 2.15a). All communications to external top-level modules should use a custom
distributed sc_signal that is translated in the kernel to MPI communication with the
top-level target module running in another process. The message contains the channel
name, timestamp and the binary of the new value without endianness verification, so the
LPs can not execute in machines with different endianness. When a top-level module
receives the MPI message, it translates back to SystemC event in the target domain.
RITSim employs a purely conservative approach; only safe events are processed. A
global synchronization manage the earliest input time, a global clock that has the smallest
timestamp for which the LP may expect to receive a message. RITSim offers an automatic
LPs partition, but both require manual grouping in the top-level modules. The user
should replace the sc_start with the sc_start_distributed call, that creates several
configuration files based on elaboration phase and uses a Perl script to execute them
distributed with MPI (Figure 2.15b).
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 49
(a) Sample model hierarchy with top-
level module (b) Automated distribution process and output file
Figure 2.15: Top-level module and automated partition from Cox [58]
Chopard et al. [59] enable the partitioning of SystemC processes, causing minimal
changes in the original kernel and, therefore, in the syntax of the modeling language itself.
In this approach, a copy of the scheduler runs on each processing node and simulate a
subset of application modules. The consistency of the data is guaranteed through events
between processes and uses the margin of delta events specified in SystemC. The SystemC
execution order of processes is not predefined within a delta cycle phase.
To ensure synchronization, a node defined as master is responsible for receiving a time
annotation of the next expected event for each node and by computing the correct update
of the simulated time. Therefore, the other nodes can update their local time counter as
soon as they receive this information from the master node. A new module was proposed
to gather all the modules that are managed by one LP, similar to the top-level module
proposed by Cox [58].
Combes et al. [60] continues the work from Chopard et al. [59], but focused in mini-
mizing the synchronization tasks. Their work uses a similar top-level module described by
Cox [58], that explicitly defines the modules partitions. The changes from previous work
relax synchronization. This new work uses asynchronous send and receive of messages
with a buffer of length one; if one message was sent by the sender but not yet consumed
by the receiver, the sender could not send other messages until the buffer is freed. The
major contribution was the decentralized time synchronization, using message passing
and shared variable with critical sections. If one LP has no more active processes and
sent all its requests, it should wait for messages from the others LPs.
Ezudheen et al. [24] developed a new parallel SystemC kernel that schedules mul-
tiple runnable processes at the same time within a delta cycle, following the approach
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 50
of Chopard et al. [59]. The new kernel changes the official scheduler (Figure 2.16a) to
execute all runnable processes in chunks mapped to system threads assigned to a dedi-
cated CPU (Figure 2.16b). Their work explored three strategies to thread parallelization:
Work Sharing, Work Stealing, and Manual Grouping. The work sharing approach creates
threads in-order to evaluate runnable processes that will migrate to other processors. The
work stealing, in contrast, makes under-utilized processors steal threads from processors
with more than one active thread. Also, users can manually group the modules to run in
one thread allocated to the same core every clock cycle.
The manual partition uses a new API method to set the core affinity to one module.
Each thread runs all modules with the same affinity in the specified core. The parallelism
is managed with the OpenMP directives. In the evaluation phase, processes of each group
execute serially on their core, independent of other groups. The conclusion of this work
shows that although the automatic grouping is preferable to manual grouping, the manual
grouping shows a considerable improvement in speedup over others techniques, especially
with more than eight cores.
(a) SystemC scheduler (b) Ezudheen et al.’s parallel scheduler
Figure 2.16: Modification of SystemC scheduler by Ezudheen et al. [24]
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 51
Schumacher et al. [61] present a conservative synchronous parallel approach for cycle-
accurate simulation of MPSoC using a new SystemC framework, called parSC. Different
from previous works, parSC does not reduce the number of messages exchanged, it reduces
the latency and accepts a higher number of messages.
Within parSC, all SystemC processes have the same global time and execute in parallel,
the processes are activated in same delta cycle. The simulation uses only one OS process,
where the master thread conducts the fall-back sequential simulation. Master thread
decides, at the beginning of the delta cycles, the amount of offload to the parallel workers
threads (Figure 2.17). The SystemC processes must ensure that any access to shared data
in the same delta cycle is atomic. The designer must change their modules to using the
synchronization primitives (mutual exclusion, gate, and barrier).
Figure 2.17: Comparing sequential and parallel simulation for Schumacher et al. [61]
Schumacher et al. [62] revisit the parSC with the problem of review and integrate legacy
models to the new simulator. This work presents a methodology to integrate unmodified
SystemC modules into parSC. The legacy modules do not consider thread-safety since the
SystemC reference implementation execute all processes atomically in sequential order.
parSC adopts a similar approach to previous works to execute TLM-2.0 legacy mod-
ules. The proposed solution requires that these legacy models encapsulate their data and
the designer explicitly create processes groups in elaboration phase that parSC will ex-
ecute sequentially in one worker OS thread. Each group preserves the original behavior
from SystemC to this group of processes and manage the access to states of this group.
If a SystemC process needs to use IMC with another group (Figure 2.18a), the process
is temporarily reassigned and rescheduled to the target process group (Figure 2.18b). To
manage this reassignment, all TLM-2.0 sockets connections between groups must add a
wrapper module to intercept the IMC between original sockets.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 52
(a) Simple example system (b) Thread reassignment to IMC
Figure 2.18: Dynamic management of processes groups by Schumacher et al. [62]
Huang et al. [25] propose efficiency gain through the geographical distribution of TLM
SystemC models. The library for distribution SystemC simulations, called SCD, repeat-
edly call the start of simulation phase for only one delta cycle and return to main. The
proposed library decoupled the communication and synchronization from SystemC simu-
lation.
The SCD library adopts a master-slave approach to manage the global simulation
state. The master node is connected with all other simulators using the internet protocol,
and collect the local state information from all slaves to decide if it should advance the
global clock. If an error occurs during the simulation, the master is notified and then all
slaves fail immediately.
The SCD validation was done in an MPSoC framework that automatically generates
the SystemC with support for the distributed library. The validation runs a video decoding
application without details of the guest operating system. The results showed speedup of
4.5x with 20 simulators over five hosts.
Roth et al. [63] present a framework to investigate the SystemC parallelization on
experimental processor with 48 homogeneous non-cache coherent cores. This framework
explores implementations with different synchronization schemes, distributed and shared
memory, and multiple distinct address spaces. Each core executes a dedicated Linux with
both private and shared memory regions. On top of each core and OS, execute the PDES
framework instance, as a master or worker (LP) node. The framework is implemented in
a three abstraction levels hierarchical architecture to provide a reusable code: base ap-
plication, base simulation, and SystemC level. The base application provides the system
abstraction with memory management, message passing routines and memory synchro-
nization. The base simulation provides the PDES architecture and allows synchronous
and asynchronous time management. The last level implements the SystemC standard
using the lower abstraction layers.
Roth et al. [64] extend the framework presented in Roth et al. [63] to simulate NoC
based MPSoCs on symmetric multiprocessing (SMP) processors. The new methodology
presented uses a TLM module wrapper to communicate between LPs. Results demon-
strate that the approach can provide significant speedups of two orders of magnitude
versus sequential RTL simulation while exhibiting a moderate loss in accuracy.
Ventroux et al. [65] show the Rapid Virtual prototyping Emulation System (RAVES)
a special-purpose architecture to accelerate SystemC simulations. They also implemented
a lightweight and optimized parallel SystemC kernel using user-level threads targeting ex-
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 53
plore RAVES hardware acceleration. Moreover, the new SystemC kernel does not execute
on Linux OS, it runs on optimized micro-kernel.
RAVES kernel only executes the evaluation phase in parallel from the global SC_-
PROCESS queue in the proposed hardware. The other SystemC phases are still sequential,
which can reach 70% of the total execution time on Intel architecture [65]. The proposed
kernel requires dynamic allocation of all objects, without explain the motivation of this
requirement.
RAVES architecture (Figure 2.19) is composed of a kernel unit (SKU) and a set of
homogeneous cores (APC). The SKU is used to execute the simulation manager and
the sequential phases and has a hardware accelerator (SKA) to reduce the kernel and
synchronizations overheads. The APC cores execute the parallel SystemC processes in
the evaluation phases.
Figure 2.19: RAVES architecture [65]
Ainey et al. [66] developed another kernel, Parallel SystemCASS, optimized to cycle
accurate simulations of designs expressed containing states machines. Their work exe-
cutes all state machines transactions of the same cycle in parallel using OpenMP tasks
scheduling. In their testing, the static schedulers give better results than dynamic with a
balanced workload. A dynamic scheduler was proposed using the past executions of the
same simulation during the verification process.
Ziyu et al. [67] propose a parallel simulation environment that consists of a large-scale
simulation platform called ArchSim, which provides tools to manage the partitioning and
the message exchange between partitions. The ArchSim (Figure 2.20) is composed of a
master node (called global server, GS), slave execute nodes (called local server agents,
LSA) and subsystems of modules (called entities). The execution nodes construct, man-
age and schedule the subsystems, composed of SystemC modules. The platform has a
communication infrastructure and a virtual time synchronization protocol to support a
large number of subsystems.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 54
Using ArchSim, the ArchSC has a modified version of SystemC scheduler. The environ-
ment encapsulate the signal communication between different subsystems (Figure 2.21a)
with TCP/IP packages (Figure 2.21b). The architect is responsible by to group the
modules into subsystems but is very important that a simulation is split up into many
subsystems.
Figure 2.20: The architecture of the ArchSim simulation platform [67]
(a) Original RTL Simulation (b) Parallel Simulation
Figure 2.21: A RTL platform to parallel simulation using ArchSC [67]
Nanjundappa et al. [68] propose SCGPSim, that includes a source-to-source translator
to transform RTL SystemC models to NVidia CUDA. The translator used a semantic-
preserving transformation to obtain the same behaviors from the original SystemC. The
system must follow a specific layout with one initial producer (called Stimulus) and one
final consumer (called Output) [69], very similar to a traditional pipeline.
The simulation source is parsed to generate an XML with Abstract Syntax Tree (AST)
of classes, members and functions (Figure 2.22a). Then, all SystemC related items are
extracted from AST, and the SC_MODULE/SC_THREADs are translated to one CUDA thread.
All CUDA threads execute at every simulation cycle, but when one thread does not
receive its sensitive signals, it early achieves the end of cycle barrier (Figure 2.22b). Each
CUDA thread stores, in local variables, all signals used. The synchronization between
multiple copies of the same signal occurs in all cycle end. Any write to standard output
is stored in an array and displayed at the end of simulation to reduce memory transfer
between CPU and GPU.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 55
(a) Design Flow (b) Parallel SystemC Kernel
Figure 2.22: SCGPSim [68]
Sinha et al. [70] expands the work of Nanjundappa et al. [68] to work with mixed-
abstraction (RTL and TLM) and execute in parallel across Central Processing Units
(CPUs) and Graphics Processing Units (GPUs). The SystemC processes are manually
identified as suitable or not for GPU execution. Then, each set of SystemC processes, CPU
and GPU-suitable, are compiled to their target architecture (Figure 2.23). All events are
managed by the main SystemC scheduler that uses a conservative approach. Reinforcing
this approach, Nanjundappa et al. [69] showed that a solution for GPUs exclusively could
suffer from branch divergence and there exists a mapping of SystemC processes that
performs better on GPUs.
Figure 2.23: Design flow overview of parallel GPU-CPU [70]
2.6.2 Dependency Analysis
Different from the last section, where works assume model hypothesis or require that
designers manually create critical sections, this section show works that automatically
prevent race conditions using static or dynamic dependency analysis of the model.
Vinco et al. [71] proposes SAGA, another SystemC implementation for GPUs executing
RTL models. Different from Nanjundappa et al. [68], they do not restrict the execution
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 56
to pipeline simulations. SAGA treat all processes active in the same delta-cycles as
concurrent tasks but uses a static dependency graph to schedule. A dependency graph is
constructed based on the input and output signals of each SystemC process (Figure 2.24),
and so, it carries out the static schedule of processes. To reduce synchronizations, some
tasks can be duplicated on multiple execution threads.
Figure 2.24: SAGA methodology steps [71]
Chen et al. [72] analyses the SystemC threads to detect potential conflicts and relax
the global in-order event and timing update. The static analysis detects read-after-write
(RAW), write-after-read (WAR), and write-after-write (WAW) hazards on shared vari-
ables. In run-time, the extended kernel uses the analysis results to early run parallel
candidate threads that do not conflict. Each thread maintains a local timestamp and can
execute ahead of time of others running threads.
Reder et al. [73] use static and dynamic analysis to feed the parallel scheduler presented
by Roth et al. [64] and extended by Roth et al. [63] (Figure 2.25). Using LLVM, they
generate the AST and extract the SystemC specific information (processes and threads,
its static sensitivity list, and calls to scheduler) to the static abstract representation
(SAR). From SAR, they produce an extended version of the model introducing module
wrappers, helper routines and headers adaptions. For each module defined in model, a
module wrapper is created to abstract its localization to the others modules. The extended
model can be compiled using the SystemC reference or with the past proposed framework.
A new presented kernel implementation generate the dynamic abstract representation
(DAR) from the extended model. Finally, the partitions are extracted from DAR for use
by the parallel scheduler.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 57
Figure 2.25: Reder et al. [73] tool flow integrating static and dynamic model analysis.
2.6.3 Distributed Time / Relaxing Synchronizations
This section reviews the literature that explicitly shift from a global synchronized time
to a distributed time, or even relaxed timing. Both conservative and optimistic PDES
approaches are used reducing the requirement of a strong global synchronization.
Viaud et al. [74] show a model to apply a loosely conservative PDES to a collection of
SC_THREADs using TLM with its own timing extension. This approach achieves a speedup
of 50x versus the Bus Cycle Accurate (BCA) model of the same multi-core hardware
architecture. The accuracy of the solution is evaluated comparing the number of cycles of
the same event between the simulation using this approach and the reference BCA model
for TTY interruptions. For simulations with 1,000,000 cycles, the loss of the timing
accuracy was less than 800 cycles (10−3).
The component modules must use the proposed TLM with timing extension in a SC_-
THREAD. Threads represent the hardware components, each one with their local simulation
clock. No global simulation clock is defined, the local clocks advance together with the
communication between the modules.
Components can run until an upper time barrier, the look-ahead time, without any
synchronization. If a component requests a blocking operation, it is unscheduled and
waits for the answer. A poll of interruptions serialize the syscalls using the timing, and
accept any delay of pending interrupt request.
Mello et al. [75] proposes a modification to the SystemC kernel to improve performance
of simulators based on TLM-2.0 in SMP workstations. This work extends the work of
Viaud et al. [74] and proposes the TLM-DT, a typical conservative PDES that uses null
messages regularity to keep the simulation time advance. It adds a new coding style to two
SystemC timing accuracy options:Approximately Timed (AT) and Loosely Timed (LT).
In this new context, each SC_THREAD has its local time, and the centralized clock is unused.
The timing information and current local time is part of all message’s content. Only three
synchronization primitives are allowed: wait for events, and wait and notify delta events.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 58
The modules can not generate timed events, only delta, so the global SystemC clock never
advances
Figure 2.26a shows one example of this approach. A MPSoC with many clusters,
each one with initiators, targets and local interconnect modules. A global interconnection
module manages the communication between the clusters. Each module has at least one
SC_THREAD, initiators and targets use their SC_THREADs to request and answer respectively.
The interconnection modules should explicitly help the timing synchronization following
the conservative PDES algorithm.
Mello et al. [75] proposes, in same the work, the SystemC-SMP, a simulation engine
to take advantage of TLM-DT to parallelize modules using this approach in SMP work-
stations. This engine uses the gang-scheduling algorithm that group related neighbors
SC_THREADs on the same CPU, trying to minimize the communication between physical
CPUs. Each local scheduler is responsible for scheduling the SC_THREADs allocated to
the same CPU on this new kernel. Each local scheduler runs in one POSIX thread (Fig-
ure 2.26b), and the SC_THREADs are user-level threads (QuickThreads), from the reference
implementation.
SystemC-SMP also offers a manual mapping of the TLM-DT modules to SystemC-
SMP local schedulers trying to exploit the cache locality for communication. Pessoa
et al. [76] explores the locality characteristics for the SystemC-SMP with TLM-DT [75]
running simulated MPSoC. The best speedup is achieved in parallel simulations when the
local transactions represent more than 80%, which reduces the contention in the global
interconnect and increase the cache locality.
(a) A TLM-DT system example (b) Simulator Architecture
Figure 2.26: SystemC-SMP with TLM-DT modules [75]
Peeters et al. [77] present an approach for distributing simulators among cores in a
cluster using a hybrid synchronization policy, using MPI. Without modifing the SystemC
kernel, the work target the asynchronous communication between modules of a MPSoC
model. The functional consistency is preserved by the memory access and the temporal
consistency use explicit synchronization. Inspirited by Galiano et al. [19] and Mello et al.
[75], the work partitions the system model into clusters (Figure 2.27) and evaluate them
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 59
in separate instances of the simulation kernel. The communication between clusters uses
wrappers that implement a MPI message passing as SystemC modules (Figure 2.28). Two
instances must synchronize when they have some dependence (data sharing, for example).
Thus, two components that do not have dependencies, do not need to synchronize with
each other until the explicit global synchronization that occurs at regular intervals in the
simulation.
(a) Non-distributed model example (b) Example model parted into clusters
Figure 2.27: Peeters et al. [77] example.
Figure 2.28: Peeters et al. [77] wrapper.
Jones [78] also describes an experiment executing a SoC simulation on SMP machines
using the clusters of modules on multiple kernel instances. However, each instance has
its own time and does not synchronize automaticaly with the others. An interface is
provided for coarse-grain and fine-grain timing constraints. The system partition is done
by explicitly declaring in each module’s constructor its affinity number. There are a
global synchronization that limits the time separation between any two scheduler, and
each local simulation can introduce more constraints. The local simulation receives an
explicit abstraction of atomic sections that explicit mutual exclusion of the used threads
library. Modules must specify a synchronization police to each transaction, from full
conservative to full optimize approach.
Sauer et al. [79] implements CoMix, an infrastructure to connect clusters of SystemC
modules in a loosely-timed simulation. Each cluster runs in one SystemC instance, but
different from other approaches, all modules from one cluster connect to same wrapper
component (Figure 2.29). The wrapper component uses a TCP/IP connection to every
other cluster that it exchanges messages. During initialization, all instances execute a
discovery protocol where the first is responsible to publish the IP and port of the others.
CoMix also implements the common global synchronization in regular intervals.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 60
Figure 2.29: Sauer et al. [79] CoMix manual partition.
Weinstock et al. [80] present SCope, a new SystemC kernel to exploit the inherent par-
allelism of multi-processors simulations using only blocking communication. The proposal
uses one dedicated thread for each simulated processor and implements a look-ahead mech-
anism to reduce synchronization between them. The look-ahead allows that each thread
executes a specified amount of time without synchronize. The partition is defined by
the affinity declared in each module, and inherited modules share the same affinity. New
primitives calls are introduced to notify events between partitions, and must be explicitly
called in module’s code. Weinstock et al. [81] extends this work and include the support
to exclusive memory access in simulated processors.
Weinstock et al. [82] modify the SCope to support two more simulation modes: de-
terministic and fast. The new modes dynamically adjust the delay parameter to satisfy
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 61
the look-ahead constraints, which reduces the timing accuracy compared to sequential
simulation. The deterministic mode includes a timing error annotation when exchanging
messages, the requester uses this information to compensate the error in its simulation.
The fast mode uses Direct Memory Interface (DMI) between threads without proper
guarantee of memory access ordering.
Weinstock et al. [83] present SystemC-Link, a new approach different from their pre-
vious work but also limited to blocking communication. The new simulation framework
consists of segments interconnected with one-way communication channels. Each segment
has it own SystemC kernel and models, and a global controller manage the time advance
in all segments. To connect two segments, TLM sockets are connected using wrapper that
implements two communication using the shared memory.
Ventroux et al. [84] proposes SCale, a new parallel SystemC kernel, that require in-
strumentation of the standard SystemC model. The processes can use any communication
layer (TLM, IMC, global variables) using the proposed primitives to control the order-
ing errors. SCale simulations can produce global process ordering to guarantee future
deterministic execution repetitions. Similar to Peeters et al. [77], designers must define
the affinity of each process on module’s model. TLM communications are temporal de-
coupled and use global synchronization in regular time intervals. The proposed kernel
executes the evaluation of SC_THREADs in parallel (Figure 2.30) and, at end of the phase,
the kernel checks for ordering conflicts and sequentially evaluate possible immediate and
delta notifications.
Figure 2.30: SCale kernel diagram [84].
Funchal et al. [85] tried a different approach, discussed about the cooperative scheduler
of SystemC, and presented the jTLM a Java SystemC implementation with cooperative
and preemptive execution modes to use with SystemC TLM models. The preemptive
mode is non-reproducible because the host OS scheduler, and can expose not yet noticed
bugs on modules. Each SC_PROCESS run in one Java thread, managed by the OS scheduler,
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 62
but allows the execution of only actions in the current simulated time.
Based on results of jTLM [85], Moy [4] present a C++ library that extends SystemC
Loosely Timed systems to exploit tasks with time duration. The new set of primitives
allows to designer the amount of simulation time to process a computation. The de-
signer must adapt the source code to use the new primitives and can not call wait()
or notify(), and shared variables between SC_THREADs may create race-conditions. The
duration information allows the modified scheduler to execute the computation in any
time from begin to end of the period. This ensure that the task has no dependency to
the rest of platform and can execute in parallel.
When the SystemC simulation call during(), it launches one system thread with
this computation. The SystemC main thread and the computation thread will join in
the end time of duration. All other modules follow their execution, and if the SystemC
reaches to end time of any executing thread, it must wait for the end of the computation.
For example, a simulation with two modules P and Q (Listing 2.10), each one with two
computations. The code gives the information to the simulator that Q1 is independent of
P1, but P2 is time dependent of Q1 (Figure 2.31).
1 extern void P1(), P2(), Q1(), Q2();
2 struct P : sc_module, sc_during {
3 void compute() {
4 during(sc_time(25, SC_MS), P1);
5 during(sc_time(12, SC_MS), P2);
6 }
7 SC_CTOR(P) { SC_THREAD(compute); }
8 };
9 struct Q : sc_module, sc_during {
10 void compute() {
11 wait(10, SC_MS);
12 during(sc_time(13, SC_MS), Q1);
13 during(sc_time(11, SC_MS), Q2);
14 }
15 SC_CTOR(Q) { SC_THREAD(compute); }
16 };
17 int sc_main(int, char **) {
18 P p("p");
19 Q q("q");
20 sc_start();
21 return 1;
22 }
Listing 2.10: Parallelism With Duration in SystemC [4]
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 63
Figure 2.31: Execution of a Program with During Tasks [4]
2.6.4 Without SystemC
Not all simulations are restricted to the SystemC universe. Some works use the TLM
without using the SystemC standard, which gives great opportunities to optimize the
executable model but remove the compatibility with other SystemC modeling levels, as
the RTL. In another side, there are works on parallel simulations using others abstractions
and techniques. This section shows works that did not use the SystemC standard.
Radetzki et al. [86] present a adaptive modeling that allows dynamic change of the
abstraction used at the sequential simulation runtime. This work uses the bus arbitration
to decides when relax the time accuracy of TLM transactions. A specialized SystemC
port abstract the bus operations providing high level put and get operations to bus.
Salimi Khaligh et al. [87] developed a simulation kernel optimized for the accuracy
adaptive TLM [86]. This kernel exploits the natural coarse-grained concurrency of many
TLMs models removing the support to non-TLM models, there is not SystemC support.
The kernel allows the sequential or even the parallel simulation. In parallel behavior, each
manual partition executes its own kernel instance and one centralized global timestamp
is used to keep all partitions in same timestamp. The communication between partitions
uses MPI without specific changes in the code.
Khaligh et al. [88] extends their previous work, adding a dynamic load balancing to
their last concept but switching from MPI with multiple processes to a shared memory
(SHM) multi-thread implementation. The solution adopts the work share approach, where
one underutilized worker(LP) can steal jobs from others. The load balance occurs in
time synchronizations barriers, when no worker is running and avoid the requirement of
dedicated mutual exclusion.
2.6.5 Parallelization Review
The related works (Table 2.1) do not solve the problem of running many components,
not only cores, on full modern infrastructure with distributed multi-cores without need
sensible changes in source code. Some solutions were proposed to reduce the simulation
time, many of them modifying the SystemC kernel or requiring changes on component
models to support their solutions. Solutions that demand any modification in the models
require a strong set of regression tests to avoid new and unknown bugs included when
adopting it. Also, changes in kernel mean a project fork that produces a not fully studied
library.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 64
Ta
bl
e
2.
1:
R
el
at
ed
w
or
k
W
or
k
Y
ea
r
P
ar
al
le
l
P
ar
ti
ti
on
in
g
S
co
p
e
S
C
K
er
n
el
T
im
in
g
S
ou
rc
e
co
d
e
C
ox
[5
8]
20
05
D
is
tr
ib
ut
ed
A
ut
om
at
ic
G
en
er
ic
R
IT
Si
m
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
V
ia
ud
et
al
.[
74
]
20
06
M
ul
ti
-t
hr
ea
ds
M
an
ua
l
So
C
s3
M
od
ifi
ed
Lo
os
el
y
U
na
va
ila
bl
e2
H
ua
ng
et
al
.[
25
]
20
08
D
is
tr
ib
ut
ed
M
an
ua
l
So
C
s
O
ri
gi
na
l
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
C
om
be
s
et
al
.[
60
]
20
08
D
is
tr
ib
ut
ed
M
an
ua
l
So
C
s
Sy
st
em
C
-M
P
I
Lo
os
el
y
U
na
va
ila
bl
e2
G
al
ia
no
et
al
.[
19
]
20
09
D
is
tr
ib
ut
ed
M
an
ua
l
So
C
s
O
ri
gi
na
l
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
E
zu
dh
ee
n
et
al
.[
24
]
20
09
M
ul
ti
-t
hr
ea
ds
A
ut
om
at
ic
G
en
er
ic
M
od
ifi
ed
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
Zi
yu
et
al
.[
67
]
20
09
D
is
tr
ib
ut
ed
M
an
ua
l
So
C
s
A
rc
hS
C
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
N
an
ju
nd
ap
pa
et
al
.[
68
]
20
10
G
P
-G
P
U
P
re
-p
ro
ce
ss
ed
G
en
er
ic
C
U
D
A
C
en
tr
al
iz
ed
P
ar
ti
al
P
ub
lic
P
es
so
a
et
al
.[
76
]
20
10
M
ul
ti
-t
hr
ea
ds
R
un
ti
m
e
So
C
s
Sy
st
em
C
-S
M
P
C
od
e
St
yl
e
U
na
va
ila
bl
e2
P
ee
te
rs
et
al
.[
77
]
20
11
D
is
tr
ib
ut
ed
M
an
ua
l
G
en
er
ic
O
ri
gi
na
l
M
an
ua
l
U
na
va
ila
bl
e2
Si
nh
a
et
al
.[
70
]
20
12
C
P
U
an
d
G
P
U
M
an
ua
l
G
en
er
ic
M
od
ifi
ed
Lo
os
el
y
U
na
va
ila
bl
e2
C
he
n
et
al
.[
72
]
20
12
M
ul
ti
-t
hr
ea
ds
P
re
-p
ro
ce
ss
ed
G
en
er
ic
Sp
ec
C
2 -
ba
se
d
P
ar
ti
al
U
na
va
ila
bl
e2
V
in
co
et
al
.[
71
]
20
12
G
P
-G
P
U
P
re
-p
ro
ce
ss
ed
G
en
er
ic
H
IF
Su
it
e
P
re
-p
ro
ce
ss
ed
U
na
va
ila
bl
e
M
oy
[4
]
20
13
M
ul
ti
-t
hr
ea
ds
M
an
ua
l
So
C
s
O
ri
gi
na
l
Lo
os
el
y
P
ub
lic
Sc
hu
m
ac
he
r
et
al
.[
62
]
20
13
M
ul
ti
-t
hr
ea
ds
M
an
ua
l
So
C
s3
le
ga
SC
i
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
R
ot
h
et
al
.[
64
]
20
13
M
ul
ti
-p
ro
ce
ss
es
M
an
ua
l
So
C
s
B
as
ed
Lo
os
el
y
U
na
va
ila
bl
e2
V
en
tr
ou
x
et
al
.[
84
]
20
16
M
ul
ti
-t
hr
ea
ds
A
nn
ot
at
ed
G
en
er
ic
SC
al
e
C
en
tr
al
iz
ed
U
na
va
ila
bl
e2
T
h
is
D
is
se
rt
at
io
n
20
17
D
is
tr
ib
u
te
d
M
an
u
al
T
L
M
2
m
od
u
le
s
O
ri
gi
n
al
M
an
u
al
P
u
b
li
c
2
T
he
te
xt
do
es
no
t
sp
ec
ify
lic
en
se
an
d
lo
ca
ti
on
co
de
.
A
ut
ho
r
do
es
no
t
re
sp
on
d
in
ti
m
e
co
nt
ac
t
by
em
ai
l.
3
T
he
te
xt
de
fin
e
a
ge
ne
ri
c
su
pp
or
t,
bu
t
on
ly
sh
ow
re
su
lt
s
w
it
h
So
C
s
si
m
ul
at
io
ns
.
CHAPTER 2. BASIC CONCEPTS AND RELATED WORK 65
2.7 ESL and Simulators Parallelizations
Beyond the scope of this dissertation, the challenge of parallel simulation is not limited
to SystemC. This section covers other proposed solutions for parallel simulations. Many
of techniques here were already used with SystemC and presented in past sections.
Lungeanu et al. [89] implement a MPI PDES simulation of digital and analog compo-
nents using conservative, with and without look-ahead, and optimistic approach.
Matsuo et al. [90] present a distributed many-cores simulation using software Dis-
tributed Shared Memory (DSM) technique. One host simulate the memory model, this
instance is the responsible to ordering and timing. Each other host execute one processor
model with cache managed by the DSM.
Recent works reached expressive results using Pin [20] as core of their DBT multi-cores
simulators. This limit this approaches to use only Intel host and guest systems. Fu et al.
[23] present PriME, that uses MPI to communication between simulate cores. Although
PriME executes multiple guest processes and threads in a distributed solution, it requires
that multi-threads applications executes all threads in the same host to share the address
space.
Using TCP/IP, Miller et al. [15] present Graphite, including a shared simulated single
address space, a consistent OS interface with syscalls, and intercepting the threading
interface for multi-thread simulation. Carlson et al. [21] modified Graphite into new
framework, called Sniper. The main change was the interval simulation over the detailed
cycle-accurate from Graphite. The interval simulation abstract the core performance
without the tracking of individual instructions through the core’s pipeline stages.
Sanchez et al. [22] present ZSim a similar approach removing the distributed aspect
and using SHM. Each simulation period executes in two distinct phases: bound and
weave. The bound phase uses a barrier synchronization to parallel execute all guest
system partitions in cycles interval in random order. During the execution, this phase
stores all memory accesses to beyond private levels of cache hierarchy. The weave phase
execute an event-driven simulation of events traces from each domain in bound phase. The
synchronization is limited only to event dependency over domains, relying on path-altering
interference is rare in simulation of many cores using cache hierarchy to communication.
Chapter 3
Concurrent TLM-2.0
The TLM-2.0 standard defines a protocol to the transaction level simulation using asyn-
chronous message passing, with native support for models using memory-based buses, and
extensible to allow other models. Many software solutions use message passing for commu-
nication, and when these messages are asynchronous, they offer opportunities to improve
performance with parallel executions. A requester can send a message and concurrently
do some work until the answer arrives.
The fraction of execution time used by the SystemC kernel is around half regardless of
the number of simulated components [26]. To increase the parallelism and reduce global
time spent, simulations can be parted into multiple simulations containing only a subset
of the original components and providing a communication between them. Many of the
proposed works start with this same approach [3, 19, 59, 60, 74, 77], even works that
implement a new kernel [58, 63, 64, 67, 72, 71, 73, 75, 76, 78, 79, 80, 82, 83, 87, 88].
The requirements to adopt one of the related solutions include code changes, many
of them in SystemC or in modules source code. This chapter presents YAPSC, a code
annotation API to allow concurrent modules communication using asynchronous TLM-2.0
over distinct processes. The approach of YAPSC does not modify neither the SystemC
kernel nor the modules. The designer should use the YAPSC API to declare, initialize,
and connect modules.
Listing 3.1 shows the sc_main with elaboration and execution of a model composed
of one memory and one processor (Figure 3.1). The blank lines align the line numbers to
match with the Listing 3.2, a concurrent version of the same model (Figure 3.2) written
using YAPSC.
66
CHAPTER 3. CONCURRENT TLM-2.0 67
Figure 3.1: TLM-2.0 communication between core and memory (Similar to Figure 2.7).
Figure 3.2: TLM-2.0 communication between core and memory with YAPSC executing
on 2 distinct domains.
In Listing 3.1, lines 8 and 16 declare and instantiate the processor and memory mod-
ules. The configuration of these modules is done in lines 9 and 17-20. At the end of
elaboration phase, the line 22 explicitly bind the processor’s initiator socket to the mem-
ory’s target socket. The simulation loop executes in line 25, and the possible errors are
detected and treated with lines 26-30 and 32-36, respectively.
The YAPSC changes in Listing 3.2 show the minimal API required:
YAPSC_INIT in line 4, initialize the library and detect the back-end and domain;
YAPSC_MODULE in line 7-12, limits a block to execution on specific domain of one module;
YAPSC_TARGET_SOCKET in line 11, register a target socket with a name;
YAPSC_INITIATOR_SOCKET in line 22, register an initiator socket to connect to a target
socket by module and socket name;
YAPSC_START in line 25, bind the local sockets and start the simulation;
YAPSC_FINALIZE in line 45, finalize the back-end structure.
The next sections describe the behavior of each one of these API calls and the YAPSC_-
PAYLOAD_EXTENSION call.
CHAPTER 3. CONCURRENT TLM-2.0 68
1 int sc_main(int argc, char *argv[]) {
2 char **binary_argv = argv[1];
3
4
5
6 tlm_memory_nb *mem = new tlm_memory_nb("MEM", 0, MSIZE - 1);
7 load_elf(binary_argv[0], 4, false, mem);
8
9
10
11
12
13
14 mips *p = new mips("Core");
15 p->ISA.RB[29] = MSIZE;
16 p->ac_heap_ptr = ac_heap_ptr;
17 p->ac_start_addr = ac_start_addr;
18
19 p->init();
20
21 p->MEM.LOCAL_init_socket.bind(mem->socket);
22
23 try {
24 sc_start();
25 } catch (std::exception& e) {
26 SC_REPORT_WARNING(MSGID, e.what().c_str());
27 } catch (...) {
28 SC_REPORT_ERROR(MSGID,"Caught exception during simulation.");
29 }
30
31 if (not sc_end_of_simulation_invoked()) {
32 SC_REPORT_INFO(MSGID,"Stopped without explicit sc_stop()");
33 sc_stop();
34 return 1;
35 }
36
37 delete (p);
38 delete (mem);
39
40 return 0;
41 }
Listing 3.1: Processor and Memory using TLM-2.0
CHAPTER 3. CONCURRENT TLM-2.0 69
1 int sc_main(int argc, char *argv[]) {
2 char **binary_argv = argv[1];
3 YAPSC_INIT(&argc, &argv);
4 sc_memory::tlm_memory_nb *mem = NULL;
5 YAPSC_MODULE("MEM") {
6 mem = new tlm_memory_nb("MEM", 0, MSIZE - 1);
7 load_elf(binary_argv[0], 4, false, mem);
8
9 YAPSC_TARGET_SOCKET("MEM", "socket", mem->socket);
10 }
11
12 mips *p = NULL;
13 YAPSC_MODULE("Core") {
14 p = new mips("Core");
15 p->ISA.RB[29] = MSIZE;
16 p->ac_heap_ptr = ac_heap_ptr;
17 p->ac_start_addr = ac_start_addr;
18
19 p->init();
20
21 YAPSC_INITIATOR_SOCKET(p->MEM.LOCAL_init_socket, "MEM", "socket");
22 }
23
24 YAPSC_START();
25
26
27
28
29
30
31
32
33
34
35
36
37 YAPSC_MODULE("Core"){ delete (p); }
38 YAPSC_MODULE("MEM"){ delete (mem); }
39 YAPSC_FINALIZE();
40 return 0;
41 }
Listing 3.2: Processor and Memory using TLM-2.0 with YAPSC API
CHAPTER 3. CONCURRENT TLM-2.0 70
3.1 YAPSC_INIT
The YAPSC_INIT initialize the library following the signature used in MPI initialization
method. All YAPSC programs must contain a call to YAPSC_INIT, and this routine must
be called before any other YAPSC routine. YAPSC_INIT accepts references to the C/C++
argc and argv arguments to main, but neither modifies, interprets, nor distributes them.
YAPSC_INIT determines at run-time the back-end to connect the LPs of the par-
allel simulation using the environment variable YAPSC_DOMAIN. Where the content of
YAPSC_DOMAIN is a simplified version of Uniform Resource Identifier (URI) with the form
scheme:path[#fragment].
The scheme field consists of a sequence of case-insensitive characters beginning with a
letter and followed by any combination of letters, digits, period, or hyphen. The scheme
defines the back-end to be used, and the accepted values are shm, tcp, unix, and mpi
defining the SHM, TCP/IP, UDS, and the MPI back-ends, respectively.
The path field is the identification name of the parallel simulation and must follow
the scheme required syntax. The path is used by the back-end to connect all execution
domains, except by MPI back-end that has its own isolation. In this case, the name is
only for logging reasons and can be suppressed.
The fragment field is the domain number and it is required for all schemes except for
MPI back-end that includes an automatic instance numbering and ignores the fragment
value. The zero value of fragment represents the controller domain execution and does
not start the SystemC scheduler.
If there is not a valid value for YAPSC_DOMAIN, either defined using with the envi-
ronment variable or automatically detected value, concurrent execution is disabled and
executes a sequential SystemC simulation with all components.
3.1.1 Domains
All the proposed back-ends consists of a PDES with one controller and many LPs exe-
cuting their own instance of the SystemC kernel. They are identified by a unique non-
negative integer from zero, where the zero represents the controller, and the following
numbers represents the worker domains.
Each YAPSC worker domain registers its location to the controller and, when required,
should consult it to discover the location of other domains. Each LP consists of a SystemC
kernel plus a defined amount of SC_MODULEs, a subset from original simulation. The
SystemC kernel registers all modules initialization, so foreign modules should not initialize.
The early domain specialization allows that LP execution only initializes the domain
specific SC_MODULEs.
After connection to the controller, each LP query their native SC_MODULEs list and
only these modules are instantiated and initialized.
3.1.2 Controller
The controller, represented by domain number zero, is the non-SystemC domain that
manages any centralized resources and is the simulation name solver. In all back-ends,
CHAPTER 3. CONCURRENT TLM-2.0 71
LPs should wait the controller finish its initialization to connect to it and only then register
themselves. The elaboration phase only starts when each worker gets their SC_MODULEs
native list from the controller.
The controller messages (Listing 3.3) are serialized into a binary wire format which is
compact, endianness-safe, forward- and backward-compatible using protobuf library.
1 package yapsc;
2
3 message Message {
4 enum MessageType {
5 SYSCALL = 0;
6 BARRIER = 1;
7 DOMAIN_REGISTRY = 2;
8 TARGET_SOCKET_REGISTRY = 11;
9 TARGET_SOCKET_QUERY = 12;
10 END = 15;
11 }
12 MessageType type = 1;
13 uint32 Domain = 2;
14 int32 Status = 3;
15 string SocketName = 4;
16 string Address = 5;
17 uint32 Port = 6;
18 repeated string Names = 14;
19 Syscall Syscall = 15;
20 }
Listing 3.3: Controller Message Description
The controller starts loading the list of modules and their respective domain’s number,
the partitions of all SC_MODULEs. After initialize the back-end with the specific require-
ments.
The simulation name (path field in YAPSC_DOMAIN variable) is directly used to call SHM
and UDS open routines from OS. While TCP/IP back-end adopts a similar approach as
Sauer et al. [79] to share and detect the addresses and ports of controller and workers
processes publishing the simulation name. The controller domain uses multicast Domain
Name System (mDNS) to broadcast its YAPSC TCP/IP address and port with the defined
identification name. The LPs must connect, register themselves and consult another LPs’s
address and ports using the controller.
The controller also is the barrier to start and end the simulation. All LPs should
wait that all local and remote sockets are bound to all domains. The end of simulation
occurs when any LP leaves the simulation loop, and the remaining LPs are notified by
the controller..
CHAPTER 3. CONCURRENT TLM-2.0 72
3.1.3 Partitioning
In YAPSC, the modules are always described by their unique names. The SystemC kernel
emits a warning when two or more modules share the same name, which is forbidden in
YAPSC. Each LP load the list of their modules from controller, and they will instantiate,
initialize, and execute only these modules.
Ezudheen et al. [24] showed that although the automatic grouping is common prefer-
able, the manual partition shows considerable improvement in speedup over other tech-
niques. The partition of SC_MODULEs in YAPSC is specified in partition.yapsc file, that
can be written manually or using an automatic tool. Partition separated from source code
allows designers to explicitly use their knowledge of the entire simulated system in a man-
ual partition, without the need to modify the module’s code like related works presented
in the previous chapter [91, 84, 80, 82, 81, 83]. Each declared module receives a non-zero
domain number that indicates their domain.
For example, a simple 2×2 mesh with NoC (Figure 3.3) can be modeled using three
cores and one memory. All requests are routed by router modules to the target: memory,
if it is an incomplete request, or the requester core, if a completed one.
Considering the example of Figure 3.3, where all modules execute in the same process
using the sequential SystemC, the simulation can be split in concurrent processes by tiles
(Figure 3.4) or by NoC and their components (Figure 3.5). Using the description files in
Codes 3.4 and 3.5, respectively.
1 PU_A 1
2 PU_B 2
3 MEM 3
4 PU_D 4
5 Router_A 1
6 Router_B 2
7 Router_C 3
8 Router_D 4
Listing 3.4: Partition description file to Figure 3.4
1 PU_A 1
2 PU_B 2
3 MEM 3
4 PU_D 4
5 Router_A 5
6 Router_B 5
7 Router_C 5
8 Router_D 5
Listing 3.5: Partition description file to Figure 3.5
CHAPTER 3. CONCURRENT TLM-2.0 73
Figure 3.3: Diagram of 2×2 NoC example
Figure 3.4: 2×2 NoC example divided by tiles
Figure 3.5: 2×2 NoC example divided by components
3.2 YAPSC_MODULE
The domains should only instantiate and initialize their native SC_MODULEs. The YAPSC_-
MODULE describes a code block where a module can be safely instantiated and initialized.
YAPSC ignores the code related to foreign YAPSC_MODULEs. So, the same code contains
the initialization of all modules, but only executes the initialization of the modules to the
current domain.
In end of YAPSC_INIT the domain execution receives the list of its native SC_MODULEs.
Any call to YAPSC_MODULE with a different module name is skipped.
CHAPTER 3. CONCURRENT TLM-2.0 74
3.3 YAPSC_TARGET_SOCKET
TLM-2.0 standard defines that all sockets should be bound at the end of elaboration.
YAPSC postpones the bind phase to the end of elaboration, just before the call of sc_-
start. All TLM-2.0 target socket receive a unique identification pair composed of the
module’s name and one socket name.
The back-end does not need to register all target socket names in the controller. Only
the target sockets without a local binding are exported and published to the controller.
3.4 YAPSC_INITIATOR_SOCKET
The initiator socket bind is commonly executed during the socket’s module initialization.
Using the named target sockets, the initiator socket only indicates the module and socket
names of the target socket. The identification pair of the target can be already defined
locally, future defined locally, or a foreign target socket.
The postponed bind phase allows designers to write better-organized code without
following any precedence order based on socket connections. In the bind phase, YAPSC
search all indicated target sockets locally before querying the controller for the location
of a foreign target socket.
3.5 YAPSC_START
The start of the simulation kernel includes many exception checkers to detect if the sim-
ulation run correctly until call the explicit end. YAPSC reduces the required code used
to launch the simulation and includes a binding phase before call sc_start and manage
the end of simulation.
The bind phase is the last part of the elaboration phase in YAPSC. In this phase,
all named bind are processed and if connected to foreign sockets, they use the required
communication wrappers.
First, YAPSC tries to bind all initiator sockets with the registered target’s module
and socket names. If the target’s module name is not in the native SC_MODULEs list, this
initiator socket is skipped. The initiator and target sockets of the same domain are bound
with TLM-2.0 defined routine using the registered object references.
All target sockets that are not bound in early step receive a YAPSC remote wrapper
module (RW) specialized to the current back-end. Every RW is registered and published
in the controller using the messages defined in the previous sections.
After that, each missing initiator socket is bound to a YAPSC proxy wrapper module
(PW) with the target’s socket identification. The PWs queries the controller for the
connection information of the respective target socket and connects to it using the specific
back-end connection.
All LPs wait for the controller barrier to ensure that all local bindings and wrapper
connections are made. After the barrier, YAPSC call sc_start marking the end of
CHAPTER 3. CONCURRENT TLM-2.0 75
elaboration and start of simulation phase. Whenever an exit of the simulation kernel is
detected, and the controller is notified to finish all others LPs.
3.5.1 TLM-2.0 Wrapers
YAPSC uses SC_MODULE wrappers to bind the TLM-2.0 sockets that connect with external
domains. Two wrappers are proposed following the name definition from Peeters et al. [77]
(Figure 3.6), named as YAPSC proxy wrapper module (PW) and YAPSC remote wrapper
module (RW). Different from Peeters et al. [77], the MPI middleware was replaced by
the run-time back-end selection. Moreover, for each supported back-end, a specialized
implementation of both wrappers are provided.
Figure 3.6: Peeters et al. [77] wrappers (Same of Figure 2.28).
Figure 3.7: YAPSC wrappers (Same of Figure 3.2).
The PW mimics the behavior of one generic target as if the target was directly con-
nected to the initiator. The RW forwards the payload to the target module on a different
domain and does the inverse process with the response payloads. Both wrappers are
implemented as SC_MODULEs and only dispatch a transaction in their evaluation phase of
SystemC simulation.
All RW receive the identification pair of the wrapped target socket, and the same
pair is their identification to be discovered by the PW using the controller. The RW is
responsible by starting the server in the socket based back-ends and register the system
socket in the controller. PWs must wait for their connected RWs, connect to it and notify
the controller the success full connection.
The back-end connection between the wrappers must implement a bidirectional com-
munication, both PW and RW should send and receive payloads. PWs send an uncom-
pleted payloads to RW, and expect to receive completed payloads from connected RW.
A bidirectional TLM-2.0 connection requires two pairs of initiator and target sockets,
as seen in the example of Figure 3.3. Each pair of initiator and target sockets in different
domains should receive the respective pair of wrappers. Figure 3.8 shows the connection
using YAPSC wrappers on top routers from the example of Figure 3.4.
CHAPTER 3. CONCURRENT TLM-2.0 76
Figure 3.8: YAPSC connection of the top routers of example with 2×2 NoC (Figure 3.4)
3.5.2 Payload Managing
In usual SystemC TLM-2.0 simulations, the communication between SC_MODULEs uses
DMI to transmit the payload reference. A payload is an object with the request informa-
tion, but it does not contain all information directly: the data is narrowed with a pointer
to the process’s memory, and any extension is a referenced object (Figure 3.9). DMI
requires that all SC_MODULEs that access the payload data must share the same memory
address space, commonly the entire memory of a process. However, when the simula-
tion is split into distinct processes, the payload object and all referenced content (data
and payload extensions) need to migrate over domains. YAPSC avoids memory leaks and
ensures that payload seen by each simulated SC_MODULE is valid ever in a different domain.
Figure 3.9: Diagram of TLM payload with YAPSC extension (Similar to Figure 2.6)
All payloads are created as incomplete requests by one SC_MODULE, are processed
by at least one another module and returns to the initial module. So, the life cycle
of one payload starts and ends in the same SC_MODULE, and this module allocates and
frees the payload object’s memory. YAPSC keep the payloads allocated in their original
domains, the creator module can not receive a different payload object as an answer of
its requirement. Domains processing a foreign payload need to keep a copy of the original
payload with any fixed memory pointer and a custom YAPSC foreign extension. The
CHAPTER 3. CONCURRENT TLM-2.0 77
extension contains the information to detect the native domain and recovery the original
payload object. The domain identification integer and the original payload pointer go
along with the payload for all domains transfer.
When one payload is leaving its native domain, the PWmodule serializes and transmits
the payload over the back-end channel with the current domain identification and the
payload pointer attached. The receiver RW creates the copy of the payload from the
decoded message allocating temporary memory with the length and any payload extension
as well. Also, RWmust create and attach an extension with the native domain information
to the payload.
The presence of the YAPSC extension shows that a payload is not native to the current
domain, and the information should propagate until the return to the original domain.
Foreign payloads are freed by wrapper module when leaving the current domain. If the
received message has the same identification of the current domain, the wrapper uses the
original payload reference to access the already existent payload and then update their
values without allocating more memory data. The payload in its native domain will not
have the YAPSC extension.
Completed payloads follow the backward direction, from target to initiator sockets.
The behavior of RW in backward is similar to PW in the forward direction. It should
delete the temporary copy of the transitional payload.
In the example of Figure 3.6, the core module created a payload with a memory
read request, including a pointer to the place where the requested data must be copied.
The SC_MODULE sends the payload reference to the connected target socket, without any
differentiation between the PW and the memory module. PW first detects that the
payload is native to this domain and annotates it with the current domain Identifier (ID)
and the pointer to the payload in this memory space. The payload is serialized and
transmitted using the backend.
RW receives the serialized message, detects that it is a foreign payload and allocates a
temporary payload copy from the original, but all with pointers replaced to local memory
and with the YAPSC extension holding the original domain ID and payload pointer. After
creating the transitional payload, RW calls the target socket in the memory module with
the new payload. The memory will copy the requested memory region to the memory
pointed by the foreign payload and change the response status to TLM_OK_RESPONSE.
The completed transitional payload with the answer is transferred to the backward
method in RW’s initiator socket. RW recognizes that payload is not a native one, and
delete it after successful serialization and transmission to PW. Received by PW, the
returning to its domain payload does not create another object. Instead, the wrapper
uses the original payload reference to update the object already created by core module
including the memory pointed to it. The core module receives an updated payload with its
requested completed as expected in sequential simulation directly from memory module.
3.5.3 Serialization
The communication between processes requires an IPC to transmit the data. The neces-
sity of one serialization process is intrinsic to back-end implementation. Except by SHM,
CHAPTER 3. CONCURRENT TLM-2.0 78
the others (MPI,UDS, and TCP/IP) need a serialization.
YAPSC use protobuf as default serialization to its back-ends into a binary wire
format which is compact, endianness-safe, version forward- and backward-compatible.
The library uses the same serialization for payloads and controller communication (Sec-
tion 3.1.2).
The payload message (Listing 3.6) supports all attributes implemented by the generic
payload defined in reference SystemC implementation. Many of them are not used in all
simulations and are not transmitted to reduce the communication data.
1 package yapsc;
2
3 message Payload {
4 int32 Command = 1;
5 int32 ResponseStatus = 2;
6 uint64 Address = 3;
7
8 uint64 StreamingWidth = 11;
9 bool DMIAllowed = 12;
10 int32 GPOption = 13;
11
12 message Memory {
13 uint64 Address = 1;
14 bytes Data = 2;
15 uint64 Length = 3;
16 }
17 Memory Data = 4;
18 Memory ByteEnable = 5;
19
20 message Extension {
21 int64 ID = 1;
22 bytes Content = 2;
23 }
24 repeated Extension Extensions = 15;
25
26 int32 Domain = 14;
27 uint64 Pointer = 15;
28 }
Listing 3.6: Payload Message Description
YAPSC does not expect or even supports subclasses of generic payload. Designers must
use payload extensions to add custom information to their TLM-2.0 message exchanges.
YAPSC will check for all registered payload extensions to include them into serialized
message.
CHAPTER 3. CONCURRENT TLM-2.0 79
3.5.4 Syscalls
Simulations can expose their OS to guest systems; this is the case of simulations using
ArchC. The simulated cores execute guest software that interacts with the host system,
the filesystem. All syscalls must effectively execute on the same process to share the
same OS process environment. For example, a file descriptor from open syscall must be
consistent in all domains of the simulation for read, write, lseek, and close syscalls.
YAPSC adopts a synchronization of all syscalls through the controller domain. Back
to Section 3.1.2, controller messages (Listing 3.3) already defined the type for syscall
execution and the optional field with the arguments of the syscall. All guest syscall events
must call the yapsc_sys syscall wrapper routines to synchronously execute them in the
controller. The syscall messages encapsulate the syscall and the return using protobuf
(Listing 3.7).
1 package yapsc;
2
3 message Syscall {
4 enum SyscallType {
5 UNDEFINED = 0;
6 EXIT = 1;
7 CREATE = 8;
8 OPEN = 5;
9 CLOSE = 6;
10 READ = 3;
11 WRITE = 4;
12 LSEEK = 19;
13 ISATTY = 1000;
14 SBRK = 1001;
15 }
16 SyscallType type = 1;
17 int32 FileDescriptor = 2;
18 bytes Buffer = 3;
19 string FilePath = 4;
20 int32 Flags = 5;
21 int32 Mode = 6;
22 int32 Offset = 7;
23 }
Listing 3.7: Syscall Message Description: included in Listing 3.3
This support requires explicit code changes to modules to adopt YAPSC syscall wrap-
per functions. The designer that want use the host’s syscalls need replace the call to the
standard functions by the YAPSC’s versions of these functions (e.g. replace the call to
write by yapsc_syscall_write). However, none of the related works show any support
CHAPTER 3. CONCURRENT TLM-2.0 80
to syscalls execution on guest simulations executing in multiple processes.
3.5.5 End of Simulation
On sequential SystemC simulations, any SC_MODULE can request the end of simulation
in their SC_PROCESSes. This request makes SystemC leave its kernel’s simulation loop.
However, to globally execute many instances of SystemC requires the management of
start and end of simulation. One LP can not stop the simulation without notifying the
others to finish the simulation.
YAPSC control the execution of sc_start to ensure the barrier before the start and
the monitoring of end status. When detected, the end of simulation is notified to controller
with the reason information. The controller domain notifies all domains to abort their
simulations for the same reason. The controller only exits its execution after all LPs finish
their executions.
In the case of exit syscall on ArchC simulations, the controller domain is early notified
by syscall about the end of simulation. This force all simulations, including the caller, to
finish their simulations. No answer is sent in this case.
A disconnection between domains is also a valid reason to end the simulation by error.
When detected by one back-end, PW or RW must notify the controller to abort the
simulation.
3.6 YAPSC_FINALIZE
YAPSC_FINALIZE finalize the library following the signature used in MPI finalization
method. This routine cleans up all YAPSC configuration, free allocated YAPSC ob-
jects in memory, and close all remain connections. Once this routine is called, no YAPSC
routine (not even YAPSC_INIT) can be called.
3.7 YAPSC_PAYLOAD_EXTENSION
SystemC gives an integer identification for each payload extensions declared in source
code. Each payload accepts one instance of each extension and maps then by this ID.
One example of an extension is the routing information on NoC, containing the position
of origin and target in mesh network. The extension can or can not be relevant to all
foreign modules. However, it can also be added in another domain to one transitional
payload.
YAPSC_PAYLOAD_EXTENSION registers the payload typeusing the required methods to
encode and decode to byte stream. It does not require use protobuf, the designer can use
any desired method to convert the custom extension into and from the byte stream. The
decoder function also can receive an original extension reference to update the changed
values instead of creating another object’s instance.
YAPSC will, at serialization step, verify if the payload has any of the known extensions.
All known extensions of one payload are propagated, including the extensions added to
CHAPTER 3. CONCURRENT TLM-2.0 81
temporary payloads outside from their domains.
3.8 Sequential SystemC
YAPSC offers support to compile and execute a simulation code without YAPSC. Allow-
ing the simulation behavior to be compared to the sequential execution. Moreover, one
YAPSC simulation always has a fallback execution without requiring installing another
library than SystemC.
The main header of YAPSC can be copied to simulation source code and this header
translates the proposed API to traditional SystemC when the YAPSC installation is not
found. So, any simulation using YAPSC is essentially a SystemC simulation with helper
methods, like code annotation, to bind sockets by names.
YAPSC_INIT declares and initializes two hash maps to index initiator and target sockets
by the name. All blocks delimited with YAPSC_MODULE are executed, and the calls to
YAPSC_INITIATOR_SOCKET and YAPSC_TARGET_SOCKET only store the socket to the hash
maps. Finally, YAPSC_START binds the sockets in the two hash maps and call sc_start.
YAPSC_PAYLOAD_EXTENSION and YAPSC_FINALIZE methods are ignored in the sequential
execution.
Chapter 4
Experiments
This chapter describes the experiments conducted to the wrapper modules and the YAPSC
functionalities, presenting their respective results. The Section 4.1 shows the implemented
wrappers using three different IPCs over the traditional SystemC communication. The
Section 4.2 demonstrates the functionality of YAPSC running a processor and memory
model using ArchC.
4.1 Characterization
The main key to split the simulation in multiple processes is the communication. In
the traditional serial simulation, a SystemC module communicates passing the payload
object by a method call. The inclusion of two modules between the initiator and the
target will, at least, triplicate the latency of communication in each direction: initiator
to first module, first to second module, and second to target module. It is expected an
even higher overhead when adding the process of serialization and IPC.
Figure 4.1: Producer-Consumer simulation (Similar to Figure 3.7).
All latency experiments use the same synthetic modules, one consumer module and
two implementations of producer modules. The first producer (Listing 4.2) waits for an
answer from the previous request before sending another payload request. Payloads re-
ceived in nb_transport_bw notify the event transaction_done to allow the module to
send another payload. The second producer (Listing 4.1) explore the asynchronous com-
munication and removes the transaction_done, so it sends one payload request on each
simulation time without waiting the answers. The consumer module (Listing 4.1) verifies
the content of received payloads and changes the payload response status to answered.
The simulation finishes when the producer receives all replies.
82
CHAPTER 4. EXPERIMENTS 83
1 SC_MODULE(consumer) {
2 consumer(sc_core::sc_module_name module_name, unsigned int recv);
3 ~consumer();
4 tlm_utils::simple_target_socket<consumer> tsocket;
5 tlm::tlm_sync_enum nb_transport_fw(tlm::tlm_generic_payload &p,
6 tlm::tlm_phase &P,
7 sc_core::sc_time &d);
8 tlm_utils::peq_with_get<tlm::tlm_generic_payload> peq;
9 void proc();
10 };
Listing 4.1: Consumer module
1 SC_MODULE(producer) {
2 producer(sc_core::sc_module_name module_name, unsigned int send);
3 ~producer();
4 tlm_utils::simple_initiator_socket<producer> isocket;
5 tlm::tlm_sync_enum nb_transport_bw(tlm::tlm_generic_payload &p,
6 tlm::tlm_phase &P,
7 sc_core::sc_time &d);
8 tlm_utils::peq_with_get<tlm::tlm_generic_payload> peq;
9 sc_core::sc_event transaction_done;
10 void proc();
11 };
Listing 4.2: Producer module with sc_event
1 SC_MODULE(producer) {
2 producer(sc_core::sc_module_name module_name, unsigned int send);
3 ~producer();
4 tlm_utils::simple_initiator_socket<producer> isocket;
5 tlm::tlm_sync_enum nb_transport_bw(tlm::tlm_generic_payload &p,
6 tlm::tlm_phase &P,
7 sc_core::sc_time &d);
8 tlm_utils::peq_with_get<tlm::tlm_generic_payload> peq;
9 void proc();
10 };
Listing 4.3: Producer module without sc_event
The latency experiments do not use the YAPSC annotations. So no controller is
created to manage the simulation, reducing the time spend with YAPSC control routines.
The code of SystemC (Listing 4.4), UDS (Listing 4.5), MPI (Listing 4.6) and TCP/IP
(Listing 4.7) experiments are very similar but exploit their internal implementation.
CHAPTER 4. EXPERIMENTS 84
1 int sc_main(int argc, char *argv[]) {
2 int REPT;
3 srand(time(NULL));
4 REPT = 1000 + rand() % 1000000;
5
6
7
8
9
10
11
12
13 consumer *c; producer *p;
14
15 {
16 c = new consumer("consumer", REPT);
17
18
19
20
21 } {
22 p = new producer("producer", REPT);
23
24
25
26
27 p->isocket.bind(c->tsocket);
28 }
29
30 sc_start();
31 {
32
33 delete (c);
34
35 } {
36 delete (p);
37
38 }
39
40
41 return 0;
42 }
Listing 4.4: Latency experiment: SystemC, standard back-end
CHAPTER 4. EXPERIMENTS 85
1 int sc_main(int argc, char *argv[]) {
2 int REPT;
3 srand(time(NULL));
4 REPT = 1000 + rand() % 1000000;
5
6 char name[32];
7 gen_random(name, 31);
8
9 pid_t pid = fork();
10 pid = !!pid;
11
12 yapsc::proxy_uds *PW; yapsc::remote_uds *RW;
13 consumer *c; producer *p;
14
15 if (pid) {
16 c = new consumer("consumer", REPT);
17
18 RW = new yapsc::remote_uds("RW", name);
19 RW->isocket.bind(c->tsocket);
20
21 } else {
22 p = new producer("producer", REPT);
23
24
25
26 PW = new yapsc::proxy_uds("PW", name);
27 p->isocket.bind(PW->tsocket);
28 }
29
30 sc_start();
31 if (pid) {
32
33 delete (c);
34 delete (RW);
35 } else {
36 delete (p);
37 delete (PW);
38 }
39
40 google::protobuf::ShutdownProtobufLibrary();
41 return 0;
42 }
Listing 4.5: Latency experiment: UDS back-end
CHAPTER 4. EXPERIMENTS 86
1 int sc_main(int argc, char *argv[]) {
2 int REPT;
3 srand(time(NULL));
4 REPT = 1000 + rand() % 1000000;
5
6 int ret, my_rank, comm_sz, tag;
7 MPI_Init(&argc, &argv);
8 MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
9 MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
10 tag = 133;
11
12 yapsc::proxy_mpi *PW; yapsc::remote_mpi *RW;
13 consumer *c; producer *p;
14
15 if (!my_rank) {
16 c = new consumer("consumer", REPT);
17
18 RW = new yapsc::remote_mpi("RW", MPI_COMM_WORLD, tag);
19 RW->isocket.bind(c->tsocket);
20
21 } else {
22 p = new producer("producer", REPT);
23
24
25
26 PW = new yapsc::proxy_mpi("PW", MPI_COMM_WORLD, tag+1, 1, tag);
27 p->isocket.bind(PW->tsocket);
28 }
29
30 sc_start();
31 if (!my_rank) {
32
33 delete (c);
34 delete (RW);
35 } else {
36 delete (p);
37 delete (PW);
38 }
39 MPI_Finalize();
40 google::protobuf::ShutdownProtobufLibrary();
41 return 0;
42 }
Listing 4.6: Latency experiment: MPI back-end
CHAPTER 4. EXPERIMENTS 87
1 int sc_main(int argc, char *argv[]) {
2 int REPT;
3 srand(time(NULL));
4 REPT = 1000 + rand() % 1000000;
5
6 char name[32];
7 gen_random(name, 31);
8
9 pid_t pid = fork();
10 pid = !!pid;
11
12 yapsc::proxy_tcp *PW; yapsc::remote_tcp *RW;
13 consumer *c; producer *p;
14
15 if (pid) {
16 c = new consumer("consumer", REPT);
17
18 RW = new yapsc::remote_tcp("RW");
19 RW->isocket.bind(c->tsocket);
20 yapsc::tcp_zeroconf_publish_init(name, RW->port_get());
21 } else {
22 p = new producer("producer", REPT);
23
24 int port = 0; char address[64];
25 yapsc::tcp_zeroconf_discover(name, address, &port);
26 PW = new yapsc::proxy_tcp("PW", address, port);
27 p->isocket.bind(PW->tsocket);
28 }
29
30 sc_start();
31 if (pid) {
32 yapsc::tcp_zeroconf_publish_shutdown();
33 delete (c);
34 delete (RW);
35 } else {
36 delete (p);
37 delete (PW);
38 }
39
40 google::protobuf::ShutdownProtobufLibrary();
41 return 0;
42 }
Listing 4.7: Latency experiment: TCP back-end
CHAPTER 4. EXPERIMENTS 88
The first set of experiments send one payload and wait for the answer before send-
ing another one. Sending only one payload per time is the current behavior of ArchC
processors and common to many TLM-2.0 models. This set measures the latency of
communication discarding any buffer in the middle. When supported, the back-ends are
executed both in local and remote behaviors.
1e−06
1e−04
SystemC UDS MPI MPI remote TCP TCP remote
Figure 4.2: Latency (seconds) waiting payload answer
Table 4.1: Localhost latency (seconds) waiting payload answer
Type Instance Repetitions Latency (s) Std Derivation
local SystemC 141 1.07e−06 1.92e−07
local UDS 140 1.32e−05 1.10e−06
local MPI 143 1.23e−05 1.13e−06
remote MPI 35 1.90e−04 5.17e−06
local TCP 141 2.24e−05 1.10e−06
remote TCP 51 1.74e−04 5.14e−06
Table 4.1 shows that payload exchange using the IPC back-ends in localhost is around
11× slower than SystemC. Executing locally, the MPI uses an internal SHM implemen-
tation and exceeds the UDS. While the TCP/IP does not show a similar loopback opti-
mization to Sun’s UltraSPARC in [53].
Communicating over two different machines on gigabit ethernet, the MPI and TCP/IP
back-ends are around 163× slower than SystemC, around 15× slower than local IPCs,
and around 8× slower than loopback TCP/IP.
This experiment agrees with the results of Mello et al. [75], where the bests results is
achieve when the local communication represents more than 80% of the total processing
time. The communication over domain is slow and the simulations should use cache
structures to minimize the communication with other domains.
The second set of experiments explores the asynchronous communication, while the
first wait each reply, this send all requests and process the replies when they arrive. For
CHAPTER 4. EXPERIMENTS 89
example, a processor cache module can request cache lines before needing them.
1e−06
1e−04
SystemC UDS MPI MPI remote TCP TCP remote
Figure 4.3: Latency (seconds) without wait payload answer
Table 4.2: Latency (seconds) without wait payload answer
Type Instance N Latency (s) Std Derivation
local SystemC 143 1.09e−06 1.00e−07
local UDS 141 8.16e−06 9.50e−07
local MPI 169 1.19e−05 2.01e−06
remote MPI 23 1.54e−05 4.07e−06
local TCP 141 1.27e−05 8.89e−07
remote TCP 290 1.05e−05 8.45e−07
Table 4.2 shows the mean latency of a complete communication between the producer
and consumer. With the asynchronous behavior, SystemC spends the same amount of
time while all IPC back-end consistently reduced communication time.
The asynchronous local communication is around 8× slower than SystemC. The con-
tentions policies of TCP/IP on asynchronous remote connection reduces the slowdown
compared to SystemC. However, MPI shows a higher overhead with asynchronous mes-
sages than others back-ends beacause its abstraction layer over the raw IPCs.
4.2 Exploration
This section shows a processor-memory simulation written with YAPSC. Exploring the
existence or not of cache and YAPSC back-ends. The simulation with guest software uses
the syscall executions of YAPSC.
The processor-memory model (Listing 4.9) was executed using a MIPS ArchC pro-
cessor in two versions, with and without cache. In each configuration, the model was
CHAPTER 4. EXPERIMENTS 90
experimented with two LPs domains (Listing 4.8) and using the YAPSC back-ends. The
reference for each experiment is the simulation with the sequential SystemC with YAPSC.
Figure 4.4: YAPSC simulation of processor and memory in 2 domains (Similar to Fig-
ure 3.2).
1 Core 1
2 MEM 2
Listing 4.8: Partition description file to Figure 4.4
CHAPTER 4. EXPERIMENTS 91
1 int sc_main(int argc, char *argv[]) {
2 char **binary_argv = &(argv[1]);
3 YAPSC_INIT(&argc, &argv);
4 sc_memory::tlm_memory_nb *mem = NULL;
5 YAPSC_MODULE("MEM") {
6 mem = new tlm_memory_nb("MEM", 0, MEM_SIZE - 1);
7 load_elf(binary_argv[0], 4, false, mem);
8
9 YAPSC_TARGET_SOCKET("MEM", "socket", mem->socket);
10 }
11
12 mips *p = NULL;
13 YAPSC_MODULE("Core") {
14 p = new mips("Core");
15 p->ISA.RB[29] = MEM_SIZE;
16 p->ac_heap_ptr = ac_heap_ptr;
17 p->ac_start_addr = ac_start_addr;
18 yapsc_syscall_sbrk(p->ac_heap_ptr);
19 p->init();
20
21 YAPSC_INITIATOR_SOCKET(p->MEM.LOCAL_init_socket, "MEM", "socket");
22 }
23
24 YAPSC_START();
25
26
27
28
29
30
31
32
33
34
35
36
37 YAPSC_MODULE("Core") { delete (p); }
38 YAPSC_MODULE("MEM") { delete (mem); }
39 YAPSC_FINALIZE();
40 return 0;
41 }
Listing 4.9: YAPSC simulation of processor and memory
CHAPTER 4. EXPERIMENTS 92
The results use exit and helloworld synthetic benchmarks. exit is a simple main
that only start and return success, to measure the start and end process of one simulation,
including the exit syscall. helloworld is similar to exit but includes a write syscall
to standard output. The remaining benchmarks are members of MiBench [92] and of
MPSoCBench [93].
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
1e+03
exit helloworld stringsearch_small qsort_small
MPI
MPI remote
SystemC
TCP
TCP remote
UDS
Figure 4.5: YAPSC simulation time (seconds) of processor without cache
Table 4.3: YAPSC simulation time (seconds) of processor without cache
Benchmark SystemC UDS MPI MPIremote TCP
TCP
remote
exit 3.54e−04 1.46e−03 9.26e−04 7.41e−03 1.88e−03 6.46e−03
helloworld 2.24e−03 5.17e−02 3.88e−02 5.60e−01 7.08e−02 4.73e−01
stringsearch 1.14e−01 3.16e+00 3.61e+00 4.85e+01 6.01e+00 3.90e+01
qsort 8.46e+00 2.40e+02 2.73e+02 3.66e+03 4.56e+02 3.21e+03
The impact of YAPSC communication back-ends on processor simulation, Table 4.3,
is compatible with the characterization seen in Section 4.1 for the waiting behavior. De-
scribed early, ArchC processors use TLM-2.0 as blocking communication. When the
processor needs a memory value, it requests the value and waits for the answer. Real
processors can request the memory value from memory hierarchy stages before using it.
The important aspect of these experiments is the communication cost. Adopting
caches, as real processors reduce their communication latency, will cut out redundant
communication between processors and memory modules. Next experiment explores the
same benchmarks executing on processors with embedded instruction and data caches.
The purpose is not to have a calibrated cache to MIPS processor or benchmarks, only to
explore the impact of the reduction of message exchanging.
The cache inclusion on processor models reduces the overall communication cost (Ta-
ble 4.4), including the TLM-2.0 communication of SystemC back-end. The ArchC cache
CHAPTER 4. EXPERIMENTS 93
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
1e+03
exit helloworld stringsearch_small qsort_small
MPI
MPI remote
SystemC
TCP
TCP remote
UDS
Figure 4.6: YAPSC simulation time (seconds) of processor with cache
Table 4.4: YAPSC simulation time (seconds) of processor with cache
Benchmark SystemC UDS MPI MPIremote TCP
TCP
remote
exit 2.40e−04 8.98e−04 1.18e−03 1.15e−02 1.24e−03 1.37e−02
helloworld 1.42e−03 3.97e−02 3.95e−02 4.69e−01 8.25e−02 4.07e−01
stringsearch 7.01e−02 7.45e−01 1.38e+00 7.84e+00 1.54e+00 2.28e+00
qsort 4.26e+00 8.87e+01 1.33e+02 1.30e+03 1.71e+02 6.58e+02
does not use TLM-2.0 internally, which does not impact the simulation execution as
including an intermediary module between processor and memory. Simulation with pro-
cessor cache using the SystemC back-end spends around half of the simulation time than
simulation without cache. The YAPSC parallel back-ends also reduce their simulation
time. Achieving 17× of average speedup in TCP/IP back-end when executing on differ-
ent hosts.
Chapter 5
Conclusions
We offer a general technique to split SystemC simulations into different processes, not only
for simulations of processors from one specific architecture. The communications using
the proposed back-ends is transparent to TLM2 modules, that don’t need any change
in their source for common simulations. And require few changes when the simulation
requires syscall synchronization between modules. Our solution gives an opportunity
to run the same modules in distributed ambient, local, or even the original SystemC
simulation. YAPSC is extensible to more channels and to adopt more parallel techniques
in future. Moreover, this approach does not require a special SystemC kernel, that follow
the standard and is the widely used by industry and academy.
Highlights to contribution of payload serialization that is rarely described in related
works. The proposed work considers the payloads extensions, that was not found in
bibliography. The syscalls execution is an important key from ArchC that is not covered
by past works.
YAPSC is proposed as:
Parallel Each YAPSC domain executes a set of SystemC modules in parallel on same
host or distributed environment;
Simple The YAPSC API is like a code annotation that can help the designer to better
organize the code;
Modular The client, controller, proxy, and remote components are modular. Each one
can be specialized to new back-ends. And can be added or removed from a simulation
without any code changes;
Validation Friendly YAPSC runs on unmodified SystemC kernel. Moreover, all YAPSC
can be disabled at compilation time without modifying the model code.
5.1 Publications
This dissertation extends the work of the same authors Falcão et al. [17] (Appendix A),
presented in Portuguese. Which proposed a SC_MODULE that encapsulate the TLM-2.0
with IPC over a SystemC simulation parted in many processes using TCP/IP.
94
CHAPTER 5. CONCLUSIONS 95
This past work requires that designer write a new sc_main with explicitly creating
and connection the communication wrappers. The unique wrapper was the TCP/IP and
have both initiator and target sockets. So, there is no differentiation between proxy and
remote wrappers. This approach requires use dummy modules to bind the unused sockets.
The serialization was not safe to endianness and made without using another library.
The support for payload extensions was already present, but was limited to the extensions
included in MPSoCBench [93].
The payload memory management and the syscall execution used by YAPSC are
reimplementation of this work.
5.2 Future Work
5.2.1 MPSoCBench
Both works, the previous and this dissertation, target speeding up the simulation of MP-
SoC on SMP and clusters. YAPSC is simple and flexible to be adopted by MPSoCBench
as the default platform code generated. Allowing parallel simulations without removing
the support to original SystemC.
Using YAPSC, the MPSoCBench can run more than hundreds of processes. An initial
investigation from previous work, shows that a promising approach is group the routers
in one domain and each tile in its own domain.
5.2.2 Timing
YAPSC, in this version, uses only the payload timing information. Although this reduces
the synchronization between domains, many related works [80, 82, 72, 74] uses a look-
ahead time to limit how long one LP can advance ahead of the others. The common
solution is to create a barrier for all LPs in quantums, preset or dynamic.
The timing synchronization can be added to YAPSC with a simple SC_MODULE with
communication with controller. This module should sleep for the desired quatum and call
the barrier synchronization when evaluated by SystemC kernel. This addition will not
require changes in YAPSC API.
5.2.3 Offload of simulations
YAPSC is already distributed and can run in any Linux cloud. However, the automatic
detection using mDNS is limited to local networks. The TCP/IP back-end should support
direct IP and port setup besides the automatic detection.
YAPSC allows that the controller (domain 0) can run in a specific location, as one
notebook. And all LPs (executing domains) can run in servers. The simulation will
execute as notebook guest and work on its filesystem using the syscalls emulation.
CHAPTER 5. CONCLUSIONS 96
5.2.4 Diversity of communication channels
The results show that although the MPI flexibility and optimizations, the TCP/IP and
UDS performs better in specific cases. A mixed solution with TCP/IP and UDS can
perform better than MPI in all scenarios. In this case, MPI should move out to one
plug-in and remove the requirement of linking with a MPI implementation.
A plug-in system will help to support more back-ends. This can optimize YAPSC to
other setups than the initially planned, including mixed back-ends.
5.2.5 New techniques
YAPSC adopts the external configuration file to allow an easy setup of the partitions.
This file is not intended to be manually written on big simulations, in this scenarios tools
must write the configuration automatically.
Some few related works [66, 71, 72, 73] include static or dynamic analysis tools to their
solutions. Past simulations can profile the modules and their communication behaviors
to automatic partition generation from one tool. The feedback should not be limited to
the same platform simulation, but share previous analysis of shared modules.
5.3 Source Code
YAPSC project is open-source and hosted in http://tiagofalcao.github.io/yapsc.
The source code, unit and integration tests, benchmarks, and examples are at https:
//github.com/tiagofalcao/yapsc.
Bibliography
[1] S Rigo, R Azevedo, and L Santos. Eletronic System Level Design. Springer, 2011,
p. 146.
[2] Peter Pacheco. An Introduction to Parallel Programming: Errata. Elsevier, 2012.
[3] Mario Trams. “Conservative distributed discrete event simulation with systemc us-
ing explicit lookahead”. In: Digital Force White Paper (2004).
[4] Matthieu Moy. “Parallel programming with SystemC for loosely timed models: A
non-intrusive approach”. In: Design, Automation Test in Europe Conference Exhi-
bition (DATE), 2013 (Mar. 2013), pp. 9–14. issn: 1530-1591. doi: 10.7873/DATE.
2013.017.
[5] “IEEE Std 1666ˆ{TM} Standard SystemC Language Reference Manual”. In: IEEE
Computer Society, Dec. 2005.
[6] G.E. Moore and Gordon E. Moore. Progress In Digital Integrated Electronics. 1975.
doi: 10.1109/N-SSC.2006.4804410.
[7] Gordon E. Moore. “Lithography and the Future of Moore’s Law, Copyright 1995
IEEE. Reprinted with permission. Proc. SPIE Vol. 2437, pp. 2–17”. In: IEEE Solid-
State Circuits Newsletter 20.3 (Sept. 2006), pp. 37–42. issn: 1098-4232. doi: 10.
1109/N-SSC.2006.4785861.
[8] By Gordon E Moore. “Cramming more components onto integrated circuits”. In:
Solid-State Circuits Newsletter, IEEE 38.8 (1975), pp. 33–35. doi: 10.1109/N-
SSC.2006.4785860.
[9] Andrew Danowitz et al. “CPU DB: Recording Microprocessor History”. In: Queue
10.4 (Apr. 2012), p. 10. issn: 15427730. doi: 10.1145/2181796.2181798.
[10] Mark Bohr. “A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper”. In:
IEEE Solid-State Circuits Newsletter 12.1 (2007), pp. 11–13. doi: 10.1109/N-
SSC.2007.4785534.
[11] R.H. Robert H. Dennard et al. “Design of ion-implanted MOSFET’s with very
small physical dimensions”. In: IEEE Journal of Solid-State Circuits 9.5 (Oct. 1974),
pp. 256–268. doi: 10.1109/JSSC.1974.1050511.
[12] John L Hennessy and David A Patterson. “Computer architecture: a quantitative
approach”. In: Elsevier (Sept. 2012), p. 1357. issn: 00262692. doi: 10.1.1.115.
1881.
97
BIBLIOGRAPHY 98
[13] Gene M. Amdahl. “Validity of the Single Processor Approach to Achieving Large
Scale Computing Capabilities”. In: AFIPS Spring Joint Computer Conference, 1967.
AFIPS ’67 (Spring). Proceedings of the. Vol. 30. New York, New York, USA: ACM
Press, 1967, pp. 483–485. isbn: 1558605398. doi: 10.1145/1465482.1465560.
[14] A.P.Godse and D.A.Godse. Microprocessor. 2009. isbn: 8184317069.
[15] Jason E. Miller et al. “Graphite: A distributed parallel simulator for multicores”.
English. In: HPCA - 16 2010 The Sixteenth International Symposium on High-
Performance Computer Architecture. January. IEEE, Jan. 2010, pp. 1–12. isbn:
978-1-4244-5658-1. doi: 10.1109/HPCA.2010.5416635.
[16] RM Fujimoto. “Parallel discrete event simulation”. In: Proceedings of the 21st con-
ference on Winter simulation 33.10 (Oct. 1989), pp. 19–28. issn: 00010782. doi:
10.1145/76738.76741.
[17] Tiago Falcão, Liana Duenha, and Rodolfo Azevedo. “Como executar sua simulação
em mútiplos processadores sem modificar nem seus modulos nem o SystemC”. In:
Proceedings of the 16th Simpósio de Sistemas Computacionais de Alto Desempenho
- WSCAD 2015. SBC, 2015.
[18] Denis Becker, Matthieu Moy, and Jérôme Cornet. “Parallel Simulation of Loosely
Timed SystemC/TLM Programs: Challenges Raised by an Industrial Case Study”.
In: Electronics 5.2 (2016), p. 22. issn: 2079-9292. doi: 10.3390/electronics5020022.
[19] V Galiano et al. “Distributing SystemC Structures in Parallel Simulations”. In: Pro-
ceedings of SpringSim. SpringSim ’09. San Diego, CA, USA: Society for Computer
Simulation International, 2009, 173:1–173:8.
[20] Chi-Keung Luk et al. “Pin”. In: Proceedings of the 2005 ACM SIGPLAN conference
on Programming language design and implementation - PLDI ’05. Vol. 40. PLDI
’05 6. New York, New York, USA: ACM Press, June 2005, p. 190. isbn: 1595930566.
doi: 10.1145/1065010.1065034.
[21] T.E. Carlson, W. Heirman, and L. Eeckhout. High Performance Computing, Net-
working, Storage and Analysis (SC), 2011 International Conference for. 2011. doi:
10.1145/2063384.2063454.
[22] Daniel Sanchez and Christos Kozyrakis. “ZSim”. In: Proceedings of the 40th Annual
International Symposium on Computer Architecture - ISCA ’13. Vol. 41. ISCA ’13
3. New York, NY, USA: ACM, July 2013, pp. 475–486. isbn: 9781450320795. doi:
10.1145/2485922.2485963.
[23] Yaosheng Fu and David Wentzlaff. “PriME: A parallel and distributed simulator
for thousand-core chips”. In: ISPASS 2014 - IEEE International Symposium on
Performance Analysis of Systems and Software. IEEE, Mar. 2014, pp. 116–125.
isbn: 9781479936052. doi: 10.1109/ISPASS.2014.6844467.
[24] P. Ezudheen et al. “Parallelizing SystemC Kernel for Fast Hardware Simulation
on SMP Machines”. In: 2009 ACM/IEEE/SCS 23rd Workshop on Principles of
Advanced and Distributed Simulation. IEEE, June 2009, pp. 80–87. isbn: 978-0-
7695-3713-9. doi: 10.1109/PADS.2009.25.
BIBLIOGRAPHY 99
[25] Kai Huang et al. “Scalably distributed systemC simulation for embedded applica-
tions”. In: SIES’2008 - 3rd International Symposium on Industrial Embedded Sys-
tems. IEEE, June 2008, pp. 271–274. isbn: 9781424419951. doi: 10.1109/SIES.
2008.4577715.
[26] Liana Duenha, Rodolfo Azevedo, and Rodolfo Jardim de Azevedo. “Profiling High
Level Abstraction Simulators of Multiprocessor Systems”. In: Proceedings of the
Second Workshop on Circuits and Systems Design - WCAS 2012 (Sept. 2012).
[27] Luiz Santos et al. Electronic System Level Design. 2011. doi: 10.1007/978-1-
4020-9940-3.
[28] Sandro Rigo et al. “ArchC: A SystemC-based architecture description language”. In:
Proceedings - Symposium on Computer Architecture and High Performance Com-
puting. IEEE. 2004, pp. 66–73. isbn: 0769522408. doi: 10.1109/SBAC-PAD.2004.8.
[29] Rodolfo Azevedo et al. “The ArchC architecture description language and tools”.
In: International Journal of Parallel Programming 33.5 (Oct. 2005), pp. 453–484.
issn: 08857458. doi: 10.1007/s10766-005-7301-0.
[30] Pablo Viana et al. “Modeling and simulating memory hierarchies in a platform-based
design methodology”. In: Proceedings - Design, Automation and Test in Europe
Conference and Exhibition. Vol. 1. Feb. 2004, pp. 734–735. isbn: 0769520855. doi:
10.1109/DATE.2004.1268953.
[31] E Borin et al. “Fast instruction set customization”. In: Embedded Systems for Real-
Time Multimedia, 2004. ESTImedia 2004. 2nd Workshop on. IEEE, Sept. 2004,
pp. 53–58. isbn: 0-7803-8631-0. doi: 10.1109/ESTMED.2004.1359704.
[32] João Moreira et al. “Using multiple abstraction levels to speedup an MPSoC virtual
platform simulator”. In: Proceedings of the International Workshop on Rapid System
Prototyping. IEEE, May 2011, pp. 99–105. isbn: 9781457706585. doi: 10.1109/RSP.
2011.5929982.
[33] Maxiwell Salvador Garcia, Rodolfo Azevedo, and Sandro Rigo. “Optimizing a re-
targetable compiled simulator to achieve near-native performance”. In: Proceedings
- 11th Symposium on Computing Systems, WSCAD-SCC 2010. IEEE, Oct. 2010,
pp. 33–39. isbn: 9780769542744. doi: 10.1109/WSCAD-SCC.2010.17.
[34] Maxiwell Garcia, Rodolfo Azevedo, and Sandro Rigo. “Optimizing simulation in
multiprocessor platforms using dynamic-compiled simulation”. In: Proceedings - 13th
Symposium on Computing Systems, WSCAD-SSC 2012. IEEE, Oct. 2012, pp. 80–
87. isbn: 9780769548470. doi: 10.1109/WSCAD-SSC.2012.39.
[35] Sandro Rigo et al. “Teaching computer architecture using an architecture descrip-
tion language”. In: Proceedings of the 2004 workshop on Computer architecture ed-
ucation held in conjunction with the 31st International Symposium on Computer
Architecture - WCAE ’04. New York, New York, USA: ACM Press, 2004, 6–es.
isbn: 9781450347334. doi: 10.1145/1275571.1275580.
BIBLIOGRAPHY 100
[36] Rafael Auler, Paulo Cesar Centoducatte, and Edson Borin. “ACCGen: An automatic
ArchC compiler generator”. In: Proceedings - Symposium on Computer Architecture
and High Performance Computing. IEEE, Oct. 2012, pp. 278–285. isbn: 978-0-7695-
4907-1. doi: 10.1109/SBAC-PAD.2012.33.
[37] Alexandro Baldassin, Paulo Cesar Centoducatte, and Sandro Rigo. “Extending the
ArchC language for automatic generation of assemblers”. In: Proceedings - Sym-
posium on Computer Architecture and High Performance Computing. IEEE, 2005,
pp. 60–68. isbn: 076952446X. doi: 10.1109/CAHPC.2005.25.
[38] Alexandro Baldassin et al. “An open-source binary utility generator”. In: ACM
Transactions on Design Automation of Electronic Systems 13.2 (Apr. 2008), pp. 1–
17. issn: 10844309. doi: 10.1145/1344418.1344423.
[39] Charly Bechara, Nicolas Ventroux, and Daniel Etiemble. “Towards a parameteriz-
able cycle-accurate ISS in ArchC”. In: 2010 ACS/IEEE International Conference
on Computer Systems and Applications, AICCSA 2010. IEEE, May 2010, pp. 1–7.
isbn: 9781424477159. doi: 10.1109/AICCSA.2010.5586945.
[40] Bruno Albertini et al. “A computational reflection mechanism to support platform
debugging in SystemC”. In: 2007 5th IEEE/ACM/IFIP International Conference
on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Sept. 2007,
pp. 81–86. isbn: 978-1-5959-3824-4. doi: 10.1145/1289816.1289838.
[41] Cristiano Araujo et al. “Platform designer: An approach for modeling multiprocessor
platforms based on SystemC”. In: Design Automation for Embedded Systems 10.4
(2005), pp. 253–283. issn: 09295585. doi: 10.1007/s10617-006-0654-9.
[42] Guido Araujo, Sandro Rigo, and Rodolfo Azevedo. “Processor Design with ArchC”.
In: Processor Description Languages. Vol. 1. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc, 2008, pp. 275–294. isbn: 9780123742872. doi: http:
//dx.doi.org/10.1016/B978-012374287-2.50014-8.
[43] P. Viana et al. “Exploring memory hierarchy with ArchC”. In: Proceedings - Sympo-
sium on Computer Architecture and High Performance Computing. Vol. 2003-Janua.
Nov. 2003, pp. 2–9. isbn: 0769520464. doi: 10.1109/CAHPC.2003.1250315.
[44] Marcus Bartholomeu et al. Emulating Operating System Calls in Retargetable ISA
Simulators. 2003.
[45] Felipe Klein et al. “An efficient framework for high-level power exploration”. In:
Midwest Symposium on Circuits and Systems. IEEE, Aug. 2007, pp. 1046–1049.
isbn: 1424411769. doi: 10.1109/MWSCAS.2007.4488741.
[46] T. Gupta et al. “High level power and energy exploration using ArchC”. In: Proceed-
ings - 22nd International Symposium on Computer Architecture and High Perfor-
mance Computing, SBAC-PAD 2010. Oct. 2010, pp. 25–32. isbn: 9780769542164.
doi: 10.1109/SBAC-PAD.2010.13.
[47] M Guedes et al. “An ArchC approach for automatic energy consumption characteri-
zation of processors”. In: 2012 23rd IEEE International Symposium on Rapid System
Prototyping (RSP). Oct. 2012, pp. 57–63. doi: 10.1109/RSP.2012.6380691.
BIBLIOGRAPHY 101
[48] Marcelo Guedes et al. “An automatic energy consumption characterization of pro-
cessors using ArchC”. In: Journal of Systems Architecture 59.8 (Oct. 2013), pp. 603–
614. issn: 13837621. doi: 10.1016/j.sysarc.2013.05.025.
[49] Liana Duenha et al. “MPSoCBench: A toolset for MPSoC system level evalua-
tion”. In: Proceedings of the International Conference on Embedded Computer Sys-
tems: Architectures, Modeling, and Simulation (SAMOS XIV). INSPEC Number:
14564763. IEEE, July 2014, pp. 164–171. isbn: 978-1-4799-3770-7. doi: 10.1109/
SAMOS.2014.6893208.
[50] Luiz Santos et al. Electronic System Level Design. Springer Science & Business
Media, 2011, pp. 3–10. isbn: 978-1-4020-9940-3. doi: 10.1007/978-1-4020-9940-
3.
[51] Michael Kerrisk. The Linux Programming Interface. No Starch Press, 2010, p. 1552.
isbn: 978-1-59327-220-3.
[52] W Richard Stevens and Stephen A Rago. Advanced Programming in the UNIX
Environment. 3rd. Addison-Wesley Professional, 2013. isbn: 9780321637734.
[53] Kwame Wright and H Kang. “Performance analysis of various mechanisms for inter-
process communication”. In: Operating Systems and Networks Lab, Dept. of . . .
(2007).
[54] Bill Nitzberg and Virginia Lo. “Distributed Shared Memory: A Survev of Issues and
Algorithms”. In: Computer (1991), pp. 42–50.
[55] Yuru Shao et al. “The Misuse of Android Unix Domain Socket and Security Impli-
cations”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security. CCS ’16. New York, NY, USA: ACM, 2013, pp. 80–91.
isbn: 9781450341394. doi: 10.1145/2976749.2978297.
[56] E. Lusk. “MPI in 2002: has it been ten years already?” In: Cluster Computing,
2002. Proceedings. 2002 IEEE International Conference on. IEEE Comput. Soc,
2002, p. 435. isbn: 0-7695-1745-5. doi: 10.1109/CLUSTR.2002.1137776.
[57] Jim Holt et al. “Software Standards for the Multicore Era”. In: IEEE Micro 29.3
(May 2009), pp. 40–51. issn: 0272-1732. doi: 10.1109/MM.2009.48.
[58] David R Cox. “RITSim: distributed systemC simulation”. In: Theses. 2005.
[59] B. Chopard, P. Combes, and J. Zory. “A conservative approach to SystemC paral-
lelization”. In: Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3994 LNCS. May
2006, pp. 653–660. isbn: 3540343857. doi: 10.1007/11758549_89.
[60] Philippe Combes et al. “Relaxing synchronization in a parallel SystemC kernel”.
In: Proceedings of the 2008 International Symposium on Parallel and Distributed
Processing with Applications, ISPA 2008. IEEE, Dec. 2008, pp. 180–187. isbn:
9780769534718. doi: 10.1109/ISPA.2008.124.
BIBLIOGRAPHY 102
[61] Christoph Schumacher et al. “parSC”. In: Proceedings of the eighth IEEE/ACM/IFIP
international conference on Hardware/software codesign and system synthesis - CODES/ISSS
’10. New York, New York, USA: ACM Press, Oct. 2010, p. 241. isbn: 9781605589053.
doi: 10.1145/1878961.1879005.
[62] Christoph Schumacher et al. “Lega SCi: Legacy system C model integration into
parallel systemc simulators”. In: Proceedings - IEEE 27th International Parallel
and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013.
IEEE, May 2013, pp. 2188–2193. isbn: 978-0-7695-4979-8. doi: 10.1109/IPDPSW.
2013.34.
[63] Christoph Roth et al. “A Framework for exploration of parallel SystemC simulation
on the single-chip cloud computer”. In: Proceedings of the 5th International ICST
Conference on Simulation Tools and Techniques (Mar. 2012), pp. 202–207. doi:
10.4108/icst.simutools.2012.247751.
[64] Christoph Roth et al. “A SystemC modeling and simulation methodology for fast
and accurate parallel MPSoC simulation”. In: Proceedings of the 26th Symposium on
Integrated Circuits and Systems Design (SBCCI). IEEE, Sept. 2013, pp. 1–6. isbn:
9781479911325. doi: 10.1109/SBCCI.2013.6644853.
[65] N. Ventroux et al. “Highly-parallel special-purpose multicore architecture for Sys-
temC/TLM simulations”. In: 2014 International Conference on Embedded Computer
Systems: Architectures, Modeling, and Simulation (SAMOS XIV). IEEE, July 2014,
pp. 250–257. isbn: 978-1-4799-3770-7. doi: 10.1109/SAMOS.2014.6893218.
[66] Lior Ainey, Avi Efrati, and Shlomo Weiss. “Parallel cycle-accurate SystemC kernel”.
In: 2014 IEEE 28th Convention of Electrical and Electronics Engineers in Israel,
IEEEI 2014. IEEE, Dec. 2014, pp. 1–5. isbn: 9781479959877. doi: 10.1109/EEEI.
2014.7005735.
[67] Hao Ziyu et al. “A Parallel SystemC Environment: ArchSC”. In: Proceedings of 15th
International Conference on Parallel and Distributed Systems. IEEE, Dec. 2009,
pp. 617–623. isbn: 9780769539003. doi: 10.1109/ICPADS.2009.28.
[68] Mahesh Nanjundappa et al. “SCGPSim: A fast SystemC simulator on GPUs”. In:
Proceedings of the Asia and South Pacific Design Automation Conference, ASP-
DAC. ASPDAC ’10. Piscataway, NJ, USA: IEEE Press, Jan. 2010, pp. 149–154.
isbn: 9781424457656. doi: 10.1109/ASPDAC.2010.5419903.
[69] Mahesh Nanjundappa et al. “Accelerating SystemC simulations using GPUs”. In:
Proceedings - IEEE International High-Level Design Validation and Test Workshop,
HLDVT (Nov. 2012), pp. 132–139. issn: 15526674. doi: 10.1109/HLDVT.2012.
6418255.
[70] Rohit Sinha, Aayush Prakash, and Hiren D. Patel. “Parallel simulation of mixed-
abstraction SystemC models on GPUs and multicore CPUs”. In: 17th Asia and
South Pacific Design Automation Conference. IEEE, Jan. 2012, pp. 455–460. isbn:
9781467307727. doi: 10.1109/ASPDAC.2012.6164991.
BIBLIOGRAPHY 103
[71] Sara Vinco et al. “SAGA: SystemC Acceleration on GPU Architectures”. In: Pro-
ceedings of Design Automation Conference (ASP-DAC 2012). DAC ’12. New York,
NY, USA: ACM, June 2012, pp. 115–120. isbn: 978-1-4503-1199-1. doi: 10.1145/
2228360.2228382.
[72] Weiwei Chen et al. “Out-of-order Parallel Simulation for ESL Design”. English. In:
Proceedings of the Conference on Design, Automation and Test in Europe. DATE
’12. San Jose, CA, USA: EDA Consortium, Mar. 2012, pp. 141–146. isbn: 978-3-
9810801-8-6. doi: 10.1109/DATE.2012.6176447.
[73] Simon Reder et al. “Adaptive algorithm and tool flow for accelerating SystemC
on many-core architectures”. In: Microprocessors and Microsystems 39.8 (2015),
pp. 1063–1075. issn: 01419331. doi: 10.1016/j.micpro.2015.06.001.
[74] E. Viaud, F. Pecheux, and A. Greiner. “An Efficient TLM/T Modeling and Simu-
lation Environment Based on Conservative Parallel Discrete Event Principles”. In:
Proceedings of the Design Automation & Test in Europe Conference. DATE ’06.
3001 Leuven, Belgium, Belgium: European Design and Automation Association,
2006, pp. 1–6. isbn: 3-9810801-1-4. doi: 10.1109/DATE.2006.244003.
[75] Aline Mello et al. “Parallel Simulation of SystemC TLM 2.0 Compliant MPSoC
on SMP Workstations”. In: Proceedings of Design, Automation and Test in Europe
- DATE. IEEE, Mar. 2010, pp. 606–609. isbn: 978-3-9810801-6-2. doi: 10.1109/
DATE.2010.5457136.
[76] Isaac Maia Pessoa et al. “Parallel TLM simulation of MPSoC on SMP workstations:
Influence of communication locality”. In: Proceedings of the International Conference
on Microelectronics, ICM. Icm. IEEE, Dec. 2010, pp. 359–362. isbn: 9781612841519.
doi: 10.1109/ICM.2010.5696160.
[77] Julien Peeters et al. “A systemc TLM framework for distributed simulation of com-
plex systems with unpredictable communication”. In: Conference on Design and
Architectures for Signal and Image Processing, DASIP. Nov. 2011, pp. 12–19. isbn:
9781457706196. doi: 10.1109/DASIP.2011.6136847.
[78] Samuel Jones. “Optimistic parallelisation of systemc”. In: Universite Joseph Fourier:
MoSiG DEMIPS, Tech. Rep (2011).
[79] Christian Sauer, Hans-Martin Bluethgen, and Hans-Peter Loeb. “Distributed, loosely-
synchronized systemC/TLM simulations of many-processor platforms”. In: Proceed-
ings of the 2014 Forum on Specification and Design Languages (FDL). Vol. 978-2-
9530. IEEE, Oct. 2014, pp. 1–8. isbn: 978-2-9530504-9-3. doi: 10.1109/FDL.2014.
7119360.
[80] Jan Henrik Weinstock et al. “Time-decoupled parallel SystemC simulation”. In: De-
sign, Automation & Test in Europe Conference & Exhibition (DATE), 2014 (Mar.
2014), p. 191. issn: 15301591. doi: 10.7873/DATE2014.204.
BIBLIOGRAPHY 104
[81] Jan Henrik Weinstock, Rainer Leupers, and Gerd Ascheid. “Modeling Exclusive
Memory Access for a Time-Decoupled Parallel SystemC Simulator”. In: Proceedings
of the 18th International Workshop on Software and Compilers for Embedded Sys-
tems - SCOPES ’15. New York, New York, USA: ACM Press, June 2015, pp. 129–
132. isbn: 9781450335935. doi: 10.1145/2764967.2771929.
[82] Jan Henrik Weinstock, Rainer Leupers, and Gerd Ascheid. “Parallel SystemC sim-
ulation for ESL design using flexible time decoupling”. In: 2015 International Con-
ference on Embedded Computer Systems: Architectures, Modeling, and Simulation
(SAMOS). IEEE, July 2015, pp. 378–383. isbn: 978-1-4673-7311-1. doi: 10.1109/
SAMOS.2015.7363702.
[83] Jan Henrik Weinstock et al. SystemC-Link : Parallel SystemC Simulation using
Time-Decoupled Segments. 2016.
[84] Nicolas Ventroux and Tanguy Sassolas. “A new parallel SystemC kernel leverag-
ing manycore architectures”. English. In: 2016 Design, Automation Test in Europe
Conference Exhibition (DATE) (Mar. 2016), pp. 487–492.
[85] Giovanni Funchal and Matthieu Moy. “jTLM: an experimentation framework for the
simulation of transaction-level models of systems-on-chip”. In: Design, Automation
& Test in Europe Conference & Exhibition (DATE), 2011. March 2011. Mar. 2011,
pp. 1–4. isbn: 9783981080179. doi: 10.1109/DATE.2011.5763309.
[86] M. Radetzki and R. Salimi Khaligh. “Accuracy-adaptive simulation of transaction
level models”. In: Proceedings -Design, Automation and Test in Europe, DATE. 2008,
pp. 788–791. isbn: 9783981080. doi: 10.1109/DATE.2008.4484912.
[87] R Salimi Khaligh and M Radetzki. “Modeling constructs and kernel for parallel
simulation of accuracy adaptive TLMs”. In: 2010 Design, Automation & Test in
Europe Conference & Exhibition (DATE 2010). 2010, pp. 1183–1188. isbn: 978-3-
9810801-6-2. doi: 10.1109/DATE.2010.5456987.
[88] R.S. Khaligh and Martin Radetzki. A dynamic load balancing method for parallel
simulation of accuracy adaptive TLMs. 2010. doi: 10.1049/ic.2010.0141.
[89] D Lungeanu and C J R Shi. “Distributed event-driven simulation of VHDL-SPICE
mixed-signal circuits”. In: Proceedings of International Conference on Computer
Design - ICCD 2001. Sept. 2001, pp. 302–307.
[90] H. Matsuo et al. “Shaman: A distributed simulator for shared memory multiproces-
sors”. In: Proceedings - IEEE Computer Society’s Annual International Symposium
on Modeling, Analysis, and Simulation of Computer and Telecommunications Sys-
tems, MASCOTS. Vol. 2002-Janua. IEEE Comput. Soc, 2002, pp. 347–355. isbn:
0769518400. doi: 10.1109/MASCOT.2002.1167095.
[91] N. Ventroux et al. “SESAM: An MPSoC Simulation Environment for Dynamic
Application Processing”. English. In: 2010 10th IEEE International Conference on
Computer and Information Technology. IEEE, June 2010, pp. 1880–1886. isbn: 978-
1-4244-7547-6. doi: 10.1109/CIT.2010.322.
BIBLIOGRAPHY 105
[92] M R Guthaus et al. “MiBench: A free, commercially representative embedded bench-
mark suite”. In: Proceedings of IEEE 4th Annual Workshop on Workload Character-
ization, held in conjunction with The 34th Annual IEEE/ACM. Dec. 2001, pp. 3–
14.
[93] Liana Duenha and Rodolfo Azevedo. “MPSoCBench : A Benchmark Suite for Eval-
uating”. In: (2013).
[94] Chris A. MacK. “Fifty years of Moore’s law”. In: IEEE Transactions on Semicon-
ductor Manufacturing 24.2 (May 2011), pp. 202–207. doi: 10.1109/TSM.2010.
2096437.
[95] John Aynsley. OSCI TLM-2.0 language reference manual. JA32. Open SystemC
Initiative. 2009.
[96] I Maia, A Greiner, and F Pecheux. “SystemC SMP: A parallel approach to speed
up Timed TLM simulation”. In: SoC-SIP-System on Chip-System in Package. 2006.
Appendix A
Como executar sua simulação em
mútiplos processadores sem modificar
nem seus módulos nem o SystemC
• Tiago R. C. Falcão
Instituto de Computação
Universidade Estadual de Campinas (Unicamp)
• Liana Duenha
Faculdade de Computação
Universidade Federal do Mato Grosso do Sul (UFMS)
• Rodolfo Jardim de Azevedo
Instituto de Computação
Universidade Estadual de Campinas (Unicamp)
A.1 Abstract
Simulation is one of the main stages in the validation process in systems design; in this
stage, system architects can verify the correctness, behavior, and performance of the tar-
get system. SystemC is a System-level Description Language (SLDL), a C++ language
extension that supports different abstraction levels. The downside is its sequential simula-
tion model that does not take advantage of the parallel processing capabilities. This paper
proposes a generic technique that allows the simulation of a set of SystemC components
by encapsulating each one in a process, which can be scheduled over cores or distributed
on a cluster. The main advantage of this approach is that it parallelize SystemC-TLM2
simulators using the original SystemC Kernel and models.
A.2 Resumo
A simulação é uma etapa importante no desenvolvimento de sistemas computacionais, por
meio de qual os sistemas em desenvolvimento são testados e tem comportamento e de-
106
APPENDIX A. WSCAD 2015 PAPER 107
sempenho avaliados. SystemC é uma Linguagem de Descrição de Sistemas (SLDL), uma
extensão da linguagem C++ com suporte a diferentes níveis de abstração. O SystemC
simula sequencialmente todo o sistema sem aproveitar possível potencial de processa-
mento paralelo. Esse artigo propõe uma abordagem genérica para permitir a simulação
de cada componente num processo distinto, que pode ser escalonado em diferentes proces-
sadores, locais ou distribuídos. A principal vantagem é paralelizar simuladores descritos
em SystemC sem a necessidade de modificar os modelos ou o SystemC.
A.3 Introdução
Mais componentes continuarão a ser adicionados num chip mesmo considerando possíveis
revisões da Lei de Moore[8], como ocorrido no passado, quando o prazo para duplicar o
número de componentes num chip foi modificado de um para dois anos. Hoje, o mercado
vivencia uma estimativa de 18 meses [94]. O fato é que o número de componentes presentes
em um chip cresce rapidamente e continuará crescendo. Nos processadores, isto representa
mais núcleos e funcionalidades de propósito específico, de tal forma que esse acúmulo de
funções justifica serem chamados de System on Chip (SoC). Essa crescente complexidade
requer adaptações no processo de simulação destes sistemas para correta compreensão do
comportamento do SoC nos estágios iniciais do projeto, antes que o mesmo seja produzido
fisicamente.
Linguagens de Descrição de Sistemas (do inglês, System Level Description Languages
- SLDLs) fornecem um conjunto de bibliotecas, tipos de dados, funcionalidades, núcleo de
simulação e componentes de descrição de hardware para modelagem e simulação de um
sistema em alto nível de abstração. SystemC [5] é uma SLDL com modelo de descrição
baseado na linguagem C++ e utilizada para modelagem e verificação de sistemas em
diferentes níveis de abstração. A temporização pode ser feita de forma vaga (loosely-
timed), com melhor desempenho, ou de forma aproximada (approximately-timed), mais
próxima a um hardware real, utilizando sincronização baseada em eventos e anotação
de tempo nas transações. Recentemente, SystemC se tornou uma escolha popular dos
projetistas de SoCs e processadores embarcados [24].
SystemC-TLM2 é um conjunto de funcionalidades para modelagem da camada de co-
municação entre componentes do sistema e é parte integrante do padrão SystemC [95].
TLM2 introduz o conceito de desacoplamento temporal, no qual as threads de todos os
módulos possuem relógios locais que podem avançar em relação ao tempo global da sim-
ulação, o que reduz gargalos e fornece melhor desempenho em troca de uma perda de
precisão na temporização. A comunicação utiliza troca de mensagens, chamadas de pay-
loads, que encapsula os dados da requisição em si e extensões que podem ser especializadas
pelo usuário.
Podemos simular um sistema simples composto por um módulo processador e uma
memória conectados diretamente por TLM2 (Figura A.1). Cada leitura/escrita deve ser
encapsulada em um payload que contenha todos os dados da requisição e enviada a partir
do núcleo pelo seu soquete inicializador para o soquete de destino do componente de
memória. Assincronamente, o payload é processado pela memória, que deve ler ou escrever
APPENDIX A. WSCAD 2015 PAPER 108
Figure A.1: Comunicação TLM2 entre componentes de núcleo e memória
na posição indicada pela mensagem, e respondida pelo mesmo caminho.
Um fator limitante para acelerar simuladores SystemC é que seu núcleo de simulação
é sequencial e baseado em eventos discretos. O escalonador seleciona um processo por
vez, que será executado até o seu final ou até que seja feita uma chamada da função wait
para que o próprio processo suspenda sua execução e volte para a fila do escalonador.
Nas simulações de sistemas com vários processadores, o núcleo do SystemC ocupa aproxi-
madamente 50% do tempo de simulação e apenas 30% são destinados ao comportamento
de todos os núcleos de processamento [26]. O SystemC escalona a execução de um módulo
por vez, mesmo que a máquina base tenha suporte à execução paralela de múltiplos pro-
cessos [24]. Por exemplo, uma plataforma virtual com 16 processadores será simulada em
apenas um núcleo de processamento da máquina real, mesmo que esta contenha múltiplos
processadores. Por que não utilizar todos os núcleos ou mesmo um cluster de servidores?
Atualmente, pesquisas para aumento de desempenho de simuladores descritos em Sys-
temC estão focadas na sua distribuição da simulação em diferentes processadores e os
melhores ganhos foram obtidos no particionamento manual da simulação [24, 25, 75, 59].
Essas técnicas não modificam o escalonador padrão do SystemC.
Esse artigo propõe uma solução que é:
Paralela Tenta usar o máximo poder de processamento disponível no sistema computa-
cional onde a simulação é executada.
Simples Uma prova de conceito que funciona e ainda demonstra possibilidades para
futuras otimizações específicas para um determinado sistema modelado;
Modular Um módulo SystemC que pode ser adicionado ou removido sem necessitar
modificar os módulos existentes e que requeira o mínimo de configuração para tal
fim;
Validável Não modifica o núcleo SystemC, que é amplamente utilizado e devidamente
validado.
A.3.1 Trabalhos Relacionados
O modelo TLM com tempo distribuído (TLM-DT) [96] propõe a modificação do nú-
cleo SystemC para uma redução do tempo de simulação para sistemas modelados com
transaction-level modeling (TLM). Seguindo o princípio de simulação paralela baseada
APPENDIX A. WSCAD 2015 PAPER 109
em eventos discretos (PDES), cada processo tem sua própria definição de “tempo local".
Esta nova abordagem usa um mecanismo de simulação paralela, chamado SystemC-SMP,
capaz de expor o poder computacional de máquinas com múltiplos processadores. Neste
modelo, cada processo é explicitamente paralisado enquanto aguarda por algum evento.
Esse tipo de mecanismo de exclusão mútua permite ao escalonador executar outro pro-
cesso. Quando um processo paralisado é notificado do evento esperado, o relógio local é
atualizado e o processo volta a executar.
Como extensão do trabalho anterior, foi apresentada uma estratégia de modelagem
para SoC com quantidade massiva de processadores (MPSoCs) baseada em memória com-
partilhada utilizando a abordagem TLM-DT e o mecanismo do SystemC-SMP [75]. Do
ponto de vista do núcleo do simulador, a plataforma em TLM-DT é vista como um
conjunto de SC_THREADs que conhecem como se comunicar e sincronizar com as demais.
Porém, o mapeamento destas threads deve ser explicitamente controlado pelo projetista
do sistema por diretivas de configuração. A limitação dessa abordagem é que todos sis-
temas devem ser devidamente adaptados para esta versão modificada do SystemC, o que
não é uma tarefa trivial.
Chopard et al. [59] permite o particionamento de processos SystemC, com mínimas
mudanças no núcleo original do SystemC e na própria sintaxe da linguagem de modelagem.
Nesta abordagem, uma cópia do escalonador executa em cada nó de processamento e
simula um subconjunto dos módulos do sistema modelado. A consistência dos dados é
garantida através dos eventos entre os processos. Para assegurar a sincronização, um nó,
definido como principal, é responsável por receber as anotações de tempo do próximo
evento esperado para cada nó e atualiza o tempo global da simulação. Desta forma,
os demais nós podem atualizar seus relógios locais assim que recebam as informações
temporais do nó principal.
Baseado nos mesmos conceitos descritos acima, Huang et al. [25] propôs ganho de
eficiência com a distribuição geográfica dos modelos SystemC. Neste contexto, o mecan-
ismo consiste de um conjunto de sistemas Linux em rede e cooperam para computar a
mesma simulação. A principal contribuição deste trabalho é implementar uma biblioteca
para simulações SystemC distribuídas, chamada de SCD (SystemC Distribution), que
paraleliza as simulações e não modifica a biblioteca original.
O ArchSC [67] é ambiente SystemC paralelo construído sobre uma plataforma de
simulação em nível de sistema de larga escala e paralela, chamado ArchSim, que fornece
ferramentas para gerenciar o particionamento e a troca de mensagens entre os módulos.
O sistema é dividido e distribuído em múltiplos processadores, cada subsistema consiste
de um simulador independente, logo escalonadores independentes. Esse trabalho não se
aplica ao TLM.
Roth et al. [64] apresenta uma metodologia baseada em SystemC-TLM para acelerar
a simulação de MPSoCs com Network on-Chip, que combina ambas vantagens: mode-
lagem em diferentes níveis de abstração e execução paralela em máquinas com múltiplos
núcleos; integra o paradigma paralelo de modelagem de eventos discretos com o conceito
de escalonadores minimalistas. O resultados apresentados demonstram que a abordagem
alcança significativa melhoria de duas ordens de grandeza versus a execução sequencial
RTL, enquanto preserva a analisabilidade e exibe moderada perda de precisão.
APPENDIX A. WSCAD 2015 PAPER 110
Peeters et al. [77] apresenta uma abordagem para distribuir simulações em proces-
sadores de um cluster com uma política de sincronização híbrida, utilizando a Message
Passing Interface (MPI). Duas instancias devem sincronizar somente quando elas tem
alguma dependência (como, por exemplo, compartilhamento de dados). Assim, dois com-
ponentes que não tenham dependências, não precisam sincronizar entre eles. Porém, a
sincronização global acontece em determinados pontos durante a simulação.
Outra metodologia de paralelização usa uma abstração mista em CPUs e GPUs para
alcançar melhor desempenho da simulação [70]. Dado um sistema em SystemC, parte
dos modelos são adaptados para rodar em GPUs CUDA. Consequentemente, um novo
SystemC é implementado. Baseado na mesma metodologia, Vinco et al. [71] apresenta
uma abordagem de simulação para RTL-SystemC em GPUs incluindo técnicas de escalon-
amento estático para redução da quantidade de eventos de sincronização. Um grafo de
dependências é construído baseado nos sinais de entrada e saída para cada processo Sys-
temC, e então é gerado o escalonador estático.
Os trabalhos mais recentes, Sniper [21], Graphite [15], ZSim [22] and PriME [23],
alcançam expressivos resultados utilizando o Pin [20] como núcleo da tradução dinâmica
de binários (DBT), assim somente funcionam em arquiteturas Intel, tanto a arquitetura
simulada quanto a hospedeira.
Os trabalhos relacionados não resolvem o problema de simular múltiplos componentes,
não somente processadores, numa infraestrutura moderna com múltiplos processadores
distribuídos sem a necessidade de sensíveis mudanças no código fonte. Muitas das soluções
propostas reduzem o tempo de simulação, porém muitas se baseiam em uma versão mod-
ificada do SystemC ou necessitam que os componentes do sistema modelado sejam ex-
plicitamente escritos para a abordagem. Qualquer modificação nos modelos requer um
grande conjunto de testes de regressão a fim de evitar novos erros possam ser incluídos nas
adaptações necessárias. Além disso, modificações no SystemC significam uma ruptura da
implementação padrão que produz um código sem o devido estudo e verificação como o
original apresenta.
A.4 TLM2 sobre TCP
O padrão TLM2 define um protocolo para simulação em nível de transação via troca de
mensagens assíncronas, suportando nativamente barramentos de memória e é também
extensível para suportar outros modelos de barramento. A troca de mensagem é am-
plamente utilizada em programas modernos, e quando assíncronos, oferecem uma maior
facilidade para paralelização. O requerente pode enviar uma mensagem e concorrente-
mente continuar a realizar trabalho enquanto não recebe a resposta.
Nossa proposta vai além, utilizar o particionamento lógico dos módulos em TLM2 para
separar a simulação em vários processos, com isso nós oferecemos mais tempo de proces-
sador para executar o comportamento de cada módulo. Metade do tempo de execução de
um processo SystemC é utilizada pelo núcleo da biblioteca, independente do número de
componentes simulados [26]. Desejamos que apenas um subconjunto de componentes do
sistema como um todo concorra pela outra metade, assim cada subconjunto terá o custo
APPENDIX A. WSCAD 2015 PAPER 111
do núcleo mas de forma geral terão mais tempo de processador para seus comportamen-
tos. Como prova de conceito, adotamos soquetes TCP que oferecem comunicação entre
processos, distribuídos ou não, com ordenação e correção de erros.
(a) Ambos soquetes (b) Inicializador (c) Destino
Figure A.2: Conexões com o módulo TCP
O módulo de TLM2 sobre TCP conecta-se com os módulos preexistentes utilizando os
soquetes de inicialização e destino (Figura A.2a), isso possibilita que possa ser conectado
a processadores, memórias, roteadores de NoC, etc. Conforme o estado de um payload
o módulo determina qual soquete TLM2 deverá ser utilizado. Os módulos costumam
apresentar apenas um tipo de soquete TLM2, inicializador ou destino, nestes casos, por
requerimentos do SystemC, um módulo postiço deve ser conectado ao soquete não uti-
lizado. Deve-se aproveitar a oportunidade para adicionar código de verificações nestes
módulos extras, uma vez que nenhum payload do sistema deve passar por eles. Junto com
o módulo TCP, fornecemos módulos de inicialização (Figura A.2b) e de destino (Figura
A.2c).
Normalmente, os componentes TLM2 de uma simulação SystemC usam a memória
compartilhada do processo para transmitir o payload apenas por referência. Porém,
quando separamos a simulação em processos distintos, o objeto de payload e todos con-
teúdos referenciados (dado e extensões) precisam ser recriados em cada novo contexto.
Quando um payload deixa pela primeira vez seu processo de origem, ele recebe uma ex-
tensão com um marcador único para registrar o contexto original dele. Todos os payloads
recebidos via TCP são checados se são pacotes retornando ao processo original ou são
estrangeiros. Os estrangeiros são devidamente alocados neste contexto e construídos para
evitar qualquer inconsistência na estrutura de dados, no momento que deixam o pro-
cesso transitório, toda memória relacionada é liberada. Os pacotes quando retornam aos
seus processos de inicialização, são identificados e apenas a estrutura de dados original é
atualizada/sobrescrita com os valores recebidos.
O payload não contém diretamente todas as informações: o dado é apontado por
um ponteiro para a memória do processo e qualquer extensão é um objeto referenciado
(Figura A.3a). Quando comunicados a outros processos, todos os dados devem ser envi-
ados explicitamente para a correta reconstrução da estrutura no processo destino. Nesta
implementação, adotamos um empacotamento simples (Figura A.3b) que cria um bloco
que contém todos dados necessários do payload e das extensões testadas. Neste caso, a
região do dado é o único campo com tamanho variável, ele é estrategicamente colocado
no final do pacote e precedido do seu comprimento, para uma decodificação rápida e sim-
ples. Campos opcionais, como extensões, são adicionadas manualmente no bloco a ser
transmitido.
APPENDIX A. WSCAD 2015 PAPER 112
(a) Pacote (b) Bloco
Figure A.3: A mensagem TCP
Nem todos atributos dos payloads são propagados via TCP (Figura A.3b). Do payload
original são copiados o comando, o estado da resposta, o endereço virtual e o tamanho do
dado. O ponteiro original para o dado é armazenado nos dados de anotação do contexto,
junto com o marcador único do processo origem e o identificador do payload, já que o
endereço de memória apontado somente tem valor lógico para o contexto original. Nos
testes realizados, adicionamos uma extensão relativa ao sistema modelado para armazenar
informações do roteamento em uma NoC, sendo adicionada ao bloco transmitido antes
do dado.
Configuramos os soquetes TCP para enviar assim que possível os pacotes no buffer para
redução da latência, mesmo assim, raramente um bloco é enviado isolado e completo em
um pacote TCP. O módulo TCP é responsável por armazenar os pacotes TCP, montar um
bloco completo e reconstruir o payload e suas extensões e, se necessário, alocar a memória
para o dado. Para testes, podemos forçar exatamente um bloco por comunicação TCP
completando um pacote TCP com zeros.
Os testes integração criam uma rede em malha 2D de produtores e consumidores, e
cada configuração é testada sem (Exemplo A.1) e com (Exemplo A.2) os módulos TCP.
Foram testadas diversas configurações que tentam forçar os canais de comunicação com
simples ping-pong, utilizando janelas de 3 a 10 mensagens antes de um wait (escalonamento
de threads SystemC), e disparadas 100 mensagens seguidas de cada nó para todos demais
sem aguardar o retorno da resposta. Os testes são executados em cada compilação do
código fonte do módulo TCP, como descritos na Tabela A.1 com sistemas de até 200
componentes.
Não é necessária uma barreira global para iniciar a simulação, cada processo chama
a função wait_connection() antes de iniciar sua simulação para aguardar as conexões de
APPENDIX A. WSCAD 2015 PAPER 113
1 //Binding South with North
2 southRouter->N_init_socket.bind(northRouter->S_target_socket);
3
4 //Binding North with South
5 northRouter->S_init_socket.bind(southRouter->N_target_socket);
Listing A.1: Trecho da conexão de dois roteadores sem o módulo TCP
1 //Binding South
2 southTCP = new tlm_tcp("SouthTCP");
3 southRouter->N_init_socket.bind(southTCP->tsocket);
4 southTCP->isocket.bind(southRouter->N_target_socket);
5 southTCP->start_server(6801);
6 southTCP->connect("127.0.0.1", 6802);
7
8 //Binding North
9 northTCP = new tlm_tcp("NorthTCP");
10 northRouter->N_init_socket.bind(southTCP->tsocket);
11 northTCP->isocket.bind(northRouter->N_target_socket);
12 northTCP->start_server(6802);
13 northTCP->connect("127.0.0.1", 6801);
Listing A.2: Trecho da conexão de dois roteadores usando o módulo TCP
volta apenas dos seus vizinhos. Cada módulo pode começar a produzir suas requisições e
ir preenchendo as filas: do protocolo TLM2, interna do módulo TCP e da implementação
da pilha TCP/IP.
A.4.1 Exemplo: Sistema com 1 Processador
Figure A.4: Comunicação TLM2 entre um núcleo e uma memória com módulo TCP
O exemplo com um processador demonstra como dividir a simulação em múltiplos
processos com o módulo TCP. Na Figura A.1, é mostrado um exemplo simples da comu-
nicação TLM2 entre um núcleo e uma memória. Para dividir essa simulação em processos
distintos utiliza-se o módulo TCP para criar uma camada de comunicação transparente
entre os dois processos (Figura A.4).
O processador envia um payload como incompleto com uma requisição do seu soquete
inicializador para o soquete destino conectado. Não existe um conhecimento prévio do
tipo do módulo conectado, apenas sabe-se que existe um soquete destino conectado.
APPENDIX A. WSCAD 2015 PAPER 114
Figure A.5: Exemplo de NoC 2x2
O módulo TCP encapsula num bloco o payload e envia via conexão TCP preestabele-
cida para outro módulo TCP num processo distinto. O pacote TCP é lido e reconstruído
como um payload com a requisição do processador.
A memória recebe o payload no seu soquete de destino com requisição e a processa ex-
atamente como se a mesma tivesse vindo diretamente do soquete iniciador do processador.
A resposta segue o caminho inverso, totalmente transparente para ambos módulos.
Se o processador desejar, ele pode enviar uma requisição de leitura com ponteiro nulo.
A memória irá alocar e escrever o dado em seu espaço de endereçamento, retornando o
resultado como payload. O módulo TCP irá gerenciar essa alocação, liberando a mesma
do processo da memória e transferindo a mesma para o contexto original, corrigindo o
ponteiro do objeto original.
Como descrito anteriormente, o processador e a memória necessitam dos módulos
postiços para serem conectados aos soquetes não utilizados dos módulos TCP (Figura
A.2). O processador possui somente um soquete inicializador que é conectado ao soquete
destino do módulo TCP, assim necessitando ocupar o soquete inicializador do módulo
TCP. O mesmo ocorre com o soquete destino do módulo TCP da memória.
A.4.2 Exemplo: NoC - Malha 2x2 com 2 processadores
Um exemplo mais próximo a experimentos reais, uma Network on Chip (NoC) foi a
inspiração original da técnica e é facilmente conectada com módulos TCP. Uma malha 2x2
(Figura A.5) pode ser modelada com dois processadores, uma memória e um componente
de exclusão mútua em hardware. Todas requisições são roteadas pelos módulos roteadores
que determinam o destino com base no endereço, memória ou outro componente, e o
sentido com base no estado do payload, incompleto ou completo.
O módulo TCP pode ser utilizado para particionar esse sistema, um particionamento
possível é cada tile (roteador e outro componente) seja executado em um processo difer-
ente. Aqui adotamos roteadores de baixa complexidade, mas como qualquer conexão
TLM2, poderíamos utilizar os módulos TCP entre os roteadores e o componente aumen-
tando o número de processos.
Se o processador 1 solicitar uma leitura para uma posição de memória, o payload terá
sua rota definida pelo primeiro roteador (o do processador 1) e seguirá sendo roteado pelos
APPENDIX A. WSCAD 2015 PAPER 115
demais roteadores (e módulos TCP, transparentemente) até o roteador do seu destino, a
memória. A resposta pode seguir a rota inversa ou qualquer outra determinado pela
lógica do roteador. O módulo TCP não modifica o comportamento nem os módulos tem
conhecimento se os payloads estão passando por módulos TCP.
A.5 Avaliação Experimental
Nós avaliamos a técnica em duas categorias de experimentos: artificiais e um benchmark
para SoCs com multiprocessadores. Nesta seção, vamos descrever os testes e resultados
obtidos.
A.5.1 Produtor - Consumidor
Nosso primeiro conjunto de experimentos são módulos artificiais para testar a escalabili-
dade e requerimentos. Os experimentos variavam a quantidade de um módulo produtor-
consumidor TLM2, que apenas depende das bibliotecas do padrão SystemC-TLM2, para
alcançar um grande número de processos distintos.
Para N módulos, criamos uma malha quadrada de
⌈√
N
⌉2
módulos roteadores, e
preenchemos as
⌈√
N
⌉2
−N últimas posições com componentes sem lógica. Cada módulo
produtor-consumidor é configurado com um comportamento específico, quantos payloads
deve ser enviados numa janela de envio sem aguardar resposta e quantos ele deve receber.
Todos payloads gerados usam uma distribuição uniforme para determinar o endereço a
ser solicitado. Com base neste, é gerado um código, para verificação de consistência, e
armazenado no campo de dado para garantir que em qualquer módulo possa ser verificada
a quantidade e a consistência dos payloads.
As configurações da Tabela A.1 foram testadas com e sem o módulo TCP, que chegam
a 200 processos e executam a troca de 120000 payloads.
A.5.2 MPSoCBench
Os testes mais realistas foram realizados com MPSoCBench [49]. O MPSoCBench oferece
um benchmark paralelo para SoCs com muitos processadores (até 64), e provê todos
módulos: Processadores em 4 arquiteturas (ARM, PowerPC, MIPS e SPARC), memória,
exclusão mútua em hardware, roteadores para NoCs e barramentos.
Baseado no ArchC [28, 50], os processadores incluídos são gerados de arquivos de
descrição e simulam em nível de sistema as 4 arquiteturas descritas. A simulação em
nível de sistema adiciona novas dificuldades no particionamento em múltiplos processos,
o ArchC e os softwares são feitos para compartilhar não apenas a memória simulada, mas
recursos reais do sistema e de um processo.
As chamadas de sistema (em inglês, syscall) são emuladas pelo ArchC, que assim como
o SystemC, não tem conhecimento da divisão do simulador em processos distintos. Para
garantir a consistência entre as chamadas de syscalls pelo software emulado rodando em
vários processos na máquina real, criamos um controlador de syscalls externo a todos
APPENDIX A. WSCAD 2015 PAPER 116
Table A.1: Configurações testadas para o módulo Produtor-Consumidor
Tipo Módulos Roteadores Payloads Total de Payloads
Um para um N
⌈√
N
⌉2
P P
0→0 1 1 100 100
0→1 2 4 100 100
1→0 2 4 100 100
0→2 3 4 100 100
2→0 2 4 100 100
0→3 4 4 100 100
3→0 4 4 100 100
0→0 9 9 100 100
0→2 9 9 100 100
0→6 9 9 100 100
0→8 9 9 100 100
2→0 9 9 100 100
6→0 9 9 100 100
8→0 9 9 100 100
8→8 9 9 100 100
Um para todos N
⌈√
N
⌉2
P (N ∗ P )
0⇒ 9 9 100 900
4⇒ 9 9 100 900
8⇒ 9 9 100 900
Todos para um N
⌈√
N
⌉2
P (N ∗ P )
⇒0 9 9 100 900
⇒4 9 9 100 900
⇒8 9 9 100 900
Todos para todos N
⌈√
N
⌉2
P (N2 ∗ P )
⇔ 3 4 100 900
⇔ 9 9 100 8100
⇔ 144 144 10 207360
⇔ 200 200 3 120000
os processos de simulação e não dependente do SystemC, que centraliza a execução das
syscalls em apenas um contexto. A pseudo syscall SBRK, para alocação dinâmica, pre-
cisa ser resolvida centralizada pelo controlador para que os processadores aloquem regiões
distintas da memória simulada sem conflito. Todas chamadas relacionadas à manipu-
lação de arquivos utilizam o controlador para compartilhar o mesmo descritor de arquivo
(em inglês, file descriptor) do sistema operacional. Assim, esse identificador pode ser
armazenado na memória simulada e utilizado pelo software emulado em qualquer proces-
sador.
Nossa técnica aumenta o custo de comunicação, o que é coerente com a discrepância
de velocidades entre diferentes barramentos. Como em processadores modernos, o MP-
SoCBench depende de modelos com caches (e a devida coerência entre elas) para alcançar
melhores resultados reduzindo os custos excessivos de comunicação.
APPENDIX A. WSCAD 2015 PAPER 117
A.6 Conclusões
Apresentamos uma técnica genérica que permite particionar simulações SystemC em difer-
entes processos, não limitados a simulações de processadores de alguma arquitetura es-
pecifica. As comunicações utilizando o módulo TCP proposto é transparente aos demais
módulos TLM2, que não necessitam de nenhuma modificação em seu código fonte. O que
deixa a oportunidade para executar simulações dos mesmos módulos em ambientes dis-
tribuídos, locais, ou mesmo em SystemC puro sem o módulo proposto. Esse módulo TCP
não requer nenhuma versão especial ou modificada da biblioteca SystemC, o que segue
o padrão IEEE 1666-2011 [5] e é a implementação amplamente utilizada na industria e
academia.
