Ratatoskr: An open-source framework for in-depth power, performance and
  area analysis in 3D NoCs by Joseph, Jan Moritz et al.
Ratatoskr: An open-source framework for in-depth power,
performance and area analysis in 3D NoCs
JAN MORITZ JOSEPH, Otto-von-Guericke-Universität Magdeburg, Germany
LENNART BAMBERG, Universität Bremen, Germany
IMAD HAJJAR, Otto-von-Guericke-Universität Magdeburg, Germany
BEHNAM RAZI PERJIKOLAEI, Universität Bremen, Germany
ANNA DREWES, Otto-von-Guericke-Universität Magdeburg, Germany
ALBERTO GARCÍA-ORTIZ, Universität Bremen, Germany
THILO PIONTECK, Otto-von-Guericke-Universität Magdeburg, Germany
In this paper introduce Ratatoskr, an open-source framework for in-depth power, performance and area
(PPA) analysis in Networks-on-Chips (NoCs) for 3D-integrated heterogeneous System-on-Chips (SoCs). It
covers several layers of abstraction by providing a RTL NoC hardware implementation, a cycle-accurate NoC
simulator and an application model on transaction level. By this comprehensive approach, Ratatoskr can
provide the following specific PPA analyses: dynamic power of links can be estimated within 2.4% accuracy of
bit-level simulations while maintaining cycle-accurate simulation speed. Router power is determined from
RTL synthesis combined with cycle-accurate simulations. The performance of the whole NoC can be measured
both via cycle-accurate and RTL simulations. The functionality of routers can be verified using the hardware
model. The overall NoC area is estimated at the RTL using a pre-characterization of units at the gate level.
Despite these manifold features, Ratatoskr offers easy two-step user interaction: First, a single point-of-entry
that allows to set design parameters and second, PPA reports are generated automatically. For both the input
and the output, different levels of abstraction can be chosen for high-level rapid network analysis or low-level
improvement of architectural details. The proposed NoC models reduce total router power up to 32% and
router area by 3% in comparison to a conventional standard router. As a forward-thinking and unique feature
not found in other NoC PPA-measurement tools, Ratatoskr supports heterogeneous 3D integration that is one
of the most promising integration paradigms for upcoming SoCs. Thereby, Ratatoskr lies the groundwork to
design their communication architectures.
Additional Key Words and Phrases: 3D integrated circuits, Network on chip
Authors’ addresses: Jan Moritz Joseph, Otto-von-Guericke-Universität Magdeburg, Univeristätsplatz 2, Magdeburg, Sachsen-
Anhalt, 39106, Germany, moritz.joseph@ovgu.de; Lennart Bamberg, Universität Bremen, Otto-HahnAllee 1, Bremen, Bremen,
28359, Germany, bamberg@item.uni-bremen.de; Imad Hajjar, Otto-von-Guericke-Universität Magdeburg, Univeristätsplatz
2, Magdeburg, Sachsen-Anhalt, 39106, Germany, imad.hajjar@st.ovgu.de; Behnam Razi Perjikolaei, Universität Bremen, Otto-
Hahn Allee 1, Bremen, Bremen, 28359, Germany, raziperj@uni-bremen.de; Anna Drewes, Otto-von-Guericke-Universität
Magdeburg, Univeristätsplatz 2, Magdeburg, Sachsen-Anhalt, 39106, Germany, anna.drewes@st.ovgu.de; Alberto García-
Ortiz, Universität Bremen, Otto-Hahn Allee 1, Bremen, Bremen, 28359, Germany, agarcia@item.uni-bremen.de; Thilo
Pionteck, Otto-von-Guericke-Universität Magdeburg, Univeristätsplatz 2, Magdeburg, Sachsen-Anhalt, 39106, Germany,
thilo.pionteck@ovgu.de.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2020 Association for Computing Machinery.
XXXX-XXXX/2020/1-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article . Publication date: January 2020.
ar
X
iv
:1
91
2.
05
67
0v
2 
 [c
s.A
R]
  1
4 J
an
 20
20
2 J.M. Joseph et al.
ACM Reference Format:
Jan Moritz Joseph, Lennart Bamberg, Imad Hajjar, Behnam Razi Perjikolaei, Anna Drewes, Alberto García-
Ortiz, and Thilo Pionteck. 2020. Ratatoskr: An open-source framework for in-depth power, performance and
area analysis in 3D NoCs. 1, 1 (January 2020), 22 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Networks-on-Chips (NoCs), in which on-chip routers connect components (e.g. processing elements)
and transmit data using packets, are one of the most promising communication architectures for
state-of-the-art chip designs. NoCs have advantages over conventional approaches such as bus
systems: Since the number of components per chip increases constantly, buses become a bottleneck
as more and more participants fight for arbitration. NoCs tackle this issue and offer better latency
and throughput because of their parallel decentralized nature. This increases system efficiency
and thus power consumption. Therefore, integration of NoCs into whole systems has become a
vital research topic. Recently, NoCs are found in both the academic designs, e.g. in the Eyeriss
CNN-accelerator [10] or in the bio-inspired visual attention engine [27], and the industrial designs,
e.g. in the 260-core ShenWei processor Sunway SW26010 processor [12] or AMD’s infinity fabric
[30].
In addition to the prevalent trend towards higher core counts, the performance of systems can
be increased using novel integration methods. One very promising option is 3D heterogeneous
integration: Vertical 3D interconnects allow to reduce delay and power consumption [11]; also
tackles fundamental limits of computation by asymptotically reducing computation time from t
to t .75 [31]. Furthermore, the footprint shrinks through stacking because components are divided
among layers. Beside these incremental advances, 3D integration allows for one game-changing
paradigm: Integration of heterogeneous technologies. This is a fundamental advantage that is
efficiently only available using 3D integration.1 In heterogeneous 3D integrated circuits (ICs), dies
with varying electrical characteristics and technologies (analog, mixed-signal, logic and memory)
are stacked and closely interconnected. The key benefit lies in the possibility to optimize the
silicon technology node for the components of each die. Therefore, heterogeneous 3D integration
boosts performance, energy efficiency and robustness, and it allows building truly innovative
novel architectures for various applications such as high-performance computing, e.g. performance
enhancements by interleaving stacking of memory and processing dies [33]. This paradigm is as
also applied in Intel’s new Lakefield architecture using Foveros 3D technology [1]. To build fast and
efficient communication architectures for these systems, is essential to design NoCs specifically
targeting heterogeneous 3D-ICs.
As for all components integrated into a chip, an in-depth analysis of power, performance and area
(PPA) of the NoC is imperative to judge its properties. The PPA figures must take heterogeneous 3D
integration into account as it effects power (e.g. [4]) and area and performance due to them varying
with technology. Here, we propose ratatoskr: An open-source framework for PPA analysis
of 3D NoCs. It is the first comprehensive framework to precisely determine PPA for NoCs from
gate level to transaction level, i.e. including a router hardware implementation, a cycle-accurate
router model and a transaction-level application model. Furthermore, it supports heterogeneous
3D integration. Such a comprehensive framework cannot be found in the literature so far. While
individual tools exist, neither are these integrated into one single tool flow nor are these able to
cope with heterogeneous 3D integration: Simulators such as Noxim [15] and Booksim 2.0 [18]
allow for performance evaluation of general-purpose NoCs, but do not include router models
for heterogeneous 3D integration. Plus, both simulators do not offer an application model on
12.5D integration allows for heterogeneous integration, as well, but is limited by the rather poor performance of the
interconnects though the interposer layer in comparison to a true 3D approach.
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 3
transaction level to inject real-world based traffic pattern out-of-the-box. There exist models for
power consumption of routers, such as ORION [24], but these do not model heterogeneity and
do not provide a model for dynamic power consumption of links [23, Sec. 2]. Hardware models
of routers enable precise area results, but the existing implementations such as OpenSoCFabric
[14] or OpenSmart [28] are not shipped with a simulator for performance comparison and do not
reflect the special properties of heterogeneous 3D integration. Thus, Ratatoskr addresses these
shortcomings by providing the following specific features:
• Power estimation:
– Dynamic power estimation of routers and links on a cycle-accurate level.
– Accuracy of dynamic link energy estimation is within 1% of bit-level accurate simulations.
• Performance models:
– Network performance for millions of clock cycles on a cycle-accurate (CA) level using the
NoC simulator.
– Timing of routers from synthesis on the gate level. This also enables simulation of the NoC
on register-transfer level (RTL).
• Area analysis: NoC area from synthesis on gate-level for any standard cell technology.
• Benchmarks:
– Support for realistic application model on transaction-level.
– Conventional synthetic traffic patterns.
• Heterogeneous 3D Integration:
– Heterogeneity yields non-purely synchronous systems, since the same circuit in mixed-
signal and digital achieve different maximum clock speeds. Therefore, the NoC simulation
and router hardware model implement a pseudo-mesochronous router (cf. Ref. [19]).
– Heterogeneity yields different number of routers per layers, since (identical) circuits in
mixed-signal and digital have different area. Therefore, the NoC simulation allows for any
non-regular network topology via XML configuration files.
– With the same approach, NoCs in active interposers can be modeled, as well.
• Usability: Single point-of-entry to set design parameters. The design parameters allow for
rapid testing of different designs.
• Reporting: Automatic generation of detailed reports from network-scale performance to
buffer utilization.
• Open-source: The source code of the framework is available from https://github.com/jmjos/
ratatoskr.
The remainder of this paper is structured as follows: In Section 2 we give a detailed discussion of
related approaches. In Section 3, we introduce the Ratatoskr framework by starting from a user
perspective and highlight the design parameters that can be set. Next, we dig into the technical
details: We explain our models in Section 4 and our implementation of the core components (RTL
NoC model, CA NoC model, link model) in Section 5. After these rather technical sections, we
shift the focus again to the user perspective and show the generated reports, which the framework
provides for PPA analysis in Section 6. In Section 7, we show the results of our framework focusing
on properties that the NoC provides as well as the simulation performance; in Section 8 a discussion
follows. Finally, we conclude the paper in Section 9.
2 RELATEDWORK
As already argued, the Ratatoskr framework is the only comprehensive framework for modeling,
simulation and design of 3D NoCs that provides in-depth PPA results. Nonetheless, individual tools
already exist for some of Ratatoskr’s features, from which we discuss differentiating features here.
, Vol. 1, No. 1, Article . Publication date: January 2020.
4 J.M. Joseph et al.
For the cycle-accurate performance models of NoCs, many simulators have been published. An
extensive overview is given in Ref. [9, Table II, p. 3]. Three of them are currently state-of-the-art
and used in many publications:
• Noxim [9] is a NoC simulator implemented in C++ using the class library SystemC. It provides
the capability to model conventional 2D and homogeneous 3D NoCs. Furthermore, optical
links are included. The simulator measures performance (i.e. network, packet and flit latency).
• Booksim 2.0 [17] is a NoC simulator implemented in C++. Similar to Noxim, it also provides
the capability to model conventional 2D and homogeneous 3D NoCs. It measures performance
similar to Noxim.
• Garnet 2.0 [2] targets a different level of abstraction. It is integrated into the Gem5 full-system
simulator and therefore offers very high precision to assess different architectures under real
loads. However, Garnet 2.0 does not allow for a fast evaluation of millions of clock cycles due
to the naturally slow simulation performance of Gem5.
Both the Noxim and the Booksim 2.0 simulator are limited to be applied in heterogeneous 3D
integration because the provided cycle-accurate router models do not specifically target their unique
features, specifically non-purely synchronous clocked routers and non-regular typologies. Plus, it
is not possible to model routers with varying parameters per layer within the same NoC without
extensive source code modifications. Therefore, the existing tools cannot be used for performance
analysis in NoCs for heterogeneous 3D systems. Ratatoskr addresses these shortcomings.
Regarding power estimation in NoCs, ORION 3.0 [24] is the state-of-the-art tool for modeling
power consumption of router’s components. The results can either be used to characterize single
routers or they can be included in NoC simulators for high-level (i.e. cycle-accurate) power estima-
tion. While Booksim does not include power-modeling capability, Noxim counts energy-relevant
events during simulation [9, Sec. 5.1]. The figures from ORION can be used as input to compute
the dynamic router power during a simulation run. Since this is currently the best available option
for high-level router power consumption in NoCs, we use the same approach and include Noxim’s
power model. ORION 3.0 also has a basic link energy model (cf. Ref. [24, Eqs. 10, 11]) but it does
not account for the pattern-dependent coupling switching effects. Since none of the current NoC
simulators include power models for links, those within Ratatoskr therefore extend state-of-the-art,
as published in [23]. Please note, that the effects of pattern-dependent coupling switching are not
modeled in any power model of NoC routers and this is, to the best of the authors’ knowledge, still
an open research issue. The Ratatoskr framework also allows for power and area measurements
post place and route by using the provided NoC implementation . Although very slow, this is the
most accurate here. Thereby, Ratatoskr covers the full stack of power estimations and can generate
results at different accuracy levels and speeds.
For area analysis a hardware implementation of a NoC is required. Among the popular open-
source implementations, Stanford university [3] provides the implementation of the router archi-
tecture from [7]. As another example, the OpenSoCFabric project [14] is quite popular. It allows for
2D mesh and 2D flattened butterfly topology with wormhole routing and virtual channels. The
implementation is written in Chisel and enables a run-through ASIC flow.While the provided router
is very advanced and rich of features, it does not provide support for 3D integration. OpenSMART
implements a "single-cycle multi-hop asynchronous repeated traversal router" [29, p. 1], which is
state-of-the-art in terms of performance. It allows for non-regular network topologies, but it does
not feature 3D integration. Due to their shortcomings, none of the popular NoC implementations
can be used for our framework.
In general, when building a NoC for heterogeneous 3D integration, routers must be able to provide
two features: First, non-purely synchronous communication must be possible. In Ref. [19], we
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 5
Network Properties Application Properties
RTL NoC
CA NoC Model Link Model
Synthesis
Simulation Equations
- Router Power
- NoC Area
- Timing
- Transmission Matrices
- Network Performance
- NoC Power - Dynamic Link Energy
Power Performance Area
Topology,
VCs, buffers, …
Link dimensions Pet
ri n
et,
Tra
ffic
 pa
tter
n Colors
Reports
Design
Parameters
Gate-level NoC
Fig. 1. Tool flow
discuss this topic in detail and propose a pseudo-mesochronous router architecture that enhances
throughput by 2×, latency by 2.26× and dynamic power consumption by 41% with a small hardware
overhead of 2.01% for synthetic and real-world-based benchmarks using 15nm digital and 45nm
mixed-signal technology nodes compared to a conventional NoC design with homogeneous routers,
i.e. the same routers regardless of the layer’s technology node. Second, reductions of area costs in
the routers in mixed-signal layers are essential. This can be achieved by moving resources from this
layer into faster and more area-efficient layers. We assess this for buffer re-ordering in Ref. [21] and
achieve area reduction of 28% and power savings of 15% at a small 4.6% performance loss for 130nm
and 65nm nodes; for routing algorithms, we show up to 6.5× latency improvements in Ref. [19] for
different technology scenarios. These two main previous works form the foundation, on which we
develop the router model within the Ratatoskr framework. The Ratatoskr framework was used to
generate results for many publications, e.g. the aforementioned and Ref. [4–6].
3 THE RATATOSKR FRAMEWORK
Here, we introduce the Ratatoskr framework from the user perspective before we gradually go into
technical details in the subsequent sections. In this section, we explain the framework’s parts and
their functionality to generate an in-depth PPA analysis and introduce the process to set design
parameters.
3.1 Parts and functionality of the framework
The parts and functionality of Ratatoskr is shown in Fig. 1. Seen from the most abstract perspective,
the tool flow of the framework is tailored towards in-depth PPA generation for the user: Design
parameters are set and the framework generates PPA reports. Using these results, the user can
modify the parameters until design constraints are met.
There are two sets of design parameters, separated by their target: Network properties and
application properties, as shown on top of Fig. 1 on top in green boxes. The network properties
include the network topology and floorplan, the number of virtual channels (VCs), the buffer
depths, used routing algorithm and the link dimensions; the application properties define either a
, Vol. 1, No. 1, Article . Publication date: January 2020.
6 J.M. Joseph et al.
synthetic traffic pattern or an instance of the application model (cf. Sec. 4.1). Setting the parameters
is simple, as only a single-point of entry file bin/config.ini is modified and a Python script
bin/configure.py is executed. They set up the whole framework. The detailed options for pa-
rameters are introduced in Sec. 3.2. The Python script bin/plot_network.py opens a GUI and
displays the network with a floorplan.
The actual execution of the framework starts after setting the design parameters using the
aforementioned Python script. In general, there are three parts in Ratatoskr, which reflect the
different levels of abstract present. The parts are depicted in Fig. 1 in blue boxes; they are
• RTL NoC: The box on the left-hand side shows the hardware model of the NoC: On RT
level, a NoC is generated using Python scripts for meta-programming of VHDL code from
the network properties. The implemented router is the novel pseudo-mesochronous vertical
high-throughput router as published in our previous work [19]. The properties of this router
are discussed in detail in Sec. 4.2.
The advantage of an RTL NoC over a purely cycle-accurate model is that it is synthesizable to
gate level and thus it can be verified with a precise gate-level simulation with backannotated
delays. Second, the RT level NoC can be synthesized for standard cell libraries. This generates
precise data for NoC power, area and clock speed. Plus, it is possible to get results for
heterogeneous 3D integration with different technology nodes. As a comfortable area-saving
feature, our scripts remove unused parts of the crossbar and the allocator automatically
based on information about the routing algorithm. This allows increasing area efficiency
by up to 30% over conventional routers, as synthesis for a commercial digital technology
demonstrates.
The whole RTL NoC can be simulated using VHDL simulators. We provide processing
elements that inject uniform random traffic as well as a trace-file based traffic generator
for real-world application traffic. While RTL simulation is very precise, it is also very slow
and only a few thousand clock cycles can be simulated realistically. Therefore, we provide a
higher simulation performance of the NoC on CA level.
• CA NoC Model: The aforementioned NoC simulator on CA level is shown on the blue box
in the middle of Fig. 1. It actually takes both the network and the application parameters
as input. The latter are required to load realistic traffic patterns into the network. The CA
router model copes with the non-purely synchronous transmission of data, which is typical
for heterogeneous 3D systems. This is a novel feature not to be found in the competing
simulators. The simulator can be run using the Python scripts bin/run_simulations.py or
by directly running the executable. Since the simulator is a complex software and the core of
the framework, Sec. 5.1 is dedicated to the technical details.
The simulator generates results for the network performance (e.g. flit latencies) and the
dynamic router power. We use the power model of Noxim, in which power-relevant events
are counted, cf. [15, Sec. 5.1]). As an innovative feature, the simulator stores transmission
matrices. In short, these report the transition probabilities between idle and non-idle states
and the data types transmitted (i.e. modeled by the colors in the application model) of all
links in the network. This feature allows for precise dynamic and pattern-dependent link
energy, as explained in the next paragraph.
• Link model: To generate precise power results for the links, we use our power models for
single (vertical and horizontal) links [4, 6] and integrate them into our framework. This
requires using the aforementioned transmission matrices. It enables precision within 1%
of much slower bit-level simulations and also allows for post-simulation assessment of
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 7
Property Section Field Description
Application Config General parameters such as simulation time and benchmark.
Properties simulationTime length of simulation in ns
flitsPerPacket maximum length of packets
benchmark ["synthetic", "task"] to select application model
Task Configuration of task (application) model traffic.
libDir folder path for configuration data
Synthetic Configuration of synthetic uniform random traffic.
simDir folder path for temporary data
restarts number of simulations
warmupStart, warmupDuration,
warmupRate
length and injection rate in warmup phase
runRateMin,
runRateMax,
runRateStep
injection rates for the simulation
runStartAfterWarmup, runDuration timing of main run phase
numCores number of cores used
Report Configuration of reports.
bufferReportRouters list of routers’ ids to be reported
Network Hardware configurations like Network and Router properties.
z layer count
Properties x, y list of router counts per layer
routing routing algorithm
clockDelay list of clock delays per layer
bufferDepthType ["single", perVC"] single or VC-wise buffer depth
bufferDepth value for buffer depth
buffersDepths list of buffers depths (one for each VC)
vcCount list of VCs per router port
topologyFile file with network topology
flitSize bit-width of flits
Table 1. A description of software and hardware configurations using the configuration ini file.
different hop-to-hop data codings without a second simulation execution. The power models,
implemented in Python, can be found in the power folder.
After finishing the execution of the framework, PPA results are generated as shown in the
bottom-most part of Fig. 1. Beside the individual results of the framework’s parts, which already
have been described before, Ratatoskr also encapsulates the most relevant ones into reports. This
maintains high usability and is in-line with our approach to a rapid design space exploration by
iterating design parameters and testing them against constraints.
3.2 Setting design parameters
The design parameters that can be set using the config.ini file are shown in Table 1. As already
introduced, network and application properties are configured separately.
The application configuration has four sections. In Config, the general parameters of the simula-
tion are given: The simulation time and the maximum length of packets are set for the network;
the selection between synthetic or application model and a directory for application model files are
set for the benchmark. In Task, the folder holding the two files for the application model is given.
These are xml descriptions of the task graph and of the mapping of tasks to processing elements (cf.
Section 4.1). In Synthetic, synthetic traffic patterns are configured: The number of simulation runs,
used CPU cores and the temporary directory are given. Synthetic patterns have multiple phases
and the duration of warm-up and run phases can be set. Furthermore, in Report, the reports can be
configured. A list of routers can be defined that are included in the generation of statistics which
allows excluding edge-cases such as routers at the borders of the network.
, Vol. 1, No. 1, Article . Publication date: January 2020.
8 J.M. Joseph et al.
The network configuration has no sections. Here, all parameters of NoC and routers are set.
This includes the number of layers in a 3D chip as well as the router count per dimension for
conventional 3D mesh topologies. Note that the dimensions are a list to implement floorplans with
different router counts per layer as found in heterogeneous integration. If other topologies than
mesh are desired, a configuration must be done via separate Python scripts. Plus, the clock delay is
set by a list, because of varying clock frequencies in heterogeneous 3D systems. Furthermore, the
VC count and buffer depths are configured; this is possible per router or per individual VC. Next, a
file path to the network topology is given. Finally, we set the bit-width of flits.
It is possible to configure the network and the application for the simulator in further extend by
modifying the intermediate xml, which the configuration Python scripts generate as input for the
next stage of the framework, especially the simulator. The xml files are described in Section 5.1.
4 ARCHITECTURES AND MODELS
4.1 Application model
We start the description of the architecture and models on the top-most abstraction level: The
application model on transaction level. It must be abstract yet accurate and at the same time
account for all properties of typical application executed on system-level. Since we especially target
heterogeneous 3D chips, the relevant application areas must be covered; This includes a broad
spectrum of use cases: Many platforms have been proposed from high-performance processors
[26] to Vision SoCs for image processing [34] and wireless sensors for IoT (Internet of Things)
[16]. Therefore, the application model must account for (1) the timing of processing, which may
change depending on input data and implementation technology, (2) the dynamic effects of varying
input data (i.e. the statistical expected behavior) and (3) the data types transmitted for precise
power modeling (for pattern-dependent switching coupling activities). We discussed models for
applications in heterogeneous 3D SoCs in [20]. There, we argue in detail that colored statistical
Petri nets with retention time on places are able to model all effects required. The colors are used to
differentiate varying data streams with respect to pattern-dependent coupling switching.
In general, a Petri net is an application model using graphs, in which data flows along edges
are abstracted by tokens transmitted between places (vertexes). Our Petri nets model property (1)
by retention times, which delay sending tokens; property (2) is modeled using a statistical net,
in which sending tokens is associated with probabilities; property (3) is modeled by annotating
tokens with colors that reflect the activities along the data flows. An example for such a Petri net is
shown in Fig. 2. There are two places p1 and p2 modeling tasks in the application. The processing
time on that places is in the given intervals: [4, 7] and [2, 3], respectively. With probability pˆ, the
application sends tokens from p1 to itself and with probability 1 − pˆ to p2. The tokens are colored,
here depicted with red squares and purple circles (the shape is used for better accessibility). In that
very figure, we also show an exemplary layer of a NoC with 2×3 mesh topology in green. Each
rectangle represents a tile comprising router, network interface (NI) and processing element (PE).
The Petri-net is mapped on this network topology. The implementation of our application model
and an introduction on its configuration is described in Section 5.1.
4.2 Router hardware architecture
We use a lightweight router architecture both on RT and CA level that suits the needs of het-
erogeneous 3D interconnects. By configuration interfaces, we also enable rapid prototyping. The
technical details of the router are the following: We uses wormhole packet switching to reduce
buffer depths, which are rather expensive especially in mixed-signal nodes. Flow-control is realized
with credit counters. Number of ports is flexible in this design; therefore, it can be used in networks
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 9
p1 : [4, 7]
p(·) = 1
p(·) = pˆ p2 : [2, 3]
p(·) = 1 − pˆ p(·) = 1
p(·) = 1
Examplary 2×3 NoC
Mapping
Fig. 2. Example for mapping of application model to NoC.
with different topologies. This is highly relevant for applications in heterogeneous 3D SoCs since
irregular topologies are to be expected. Both simulator and router support varying flit widths. The
router architecture is shown in Fig. 3. It is split into three major parts: In the input ports, the flits are
stored in buffers. The model implements VCs and the number of VCs and the buffer depth per VC
can be set individually per router and per port. Thereby, smaller routers for instance in mixed-signal
layers can be realized by e.g. by smaller buffers depths or less VCs. The central control unit uses the
architecture proposed by [8], building the most light-weight router possible, which allows to use the
router in expensive technology nodes as well (such as mixed-signal layers). Requests from the input
ports will be processed and acknowledgments will be generated if a path is free. The central control
unit has a modular design and consists of routing calculation, VC allocation, port allocation and
switch allocation units. The VC arbiter prioritizes VCs with lower number, i.e. the first free VC with
the lowest number is chosen. The allocation is done in input-first-manner; thus, one VC per input
port requests and output port per clock cycle. The next VC may send a request after the previous
VC received an acknowledgment, following round robin. The output port assigns acknowledgments
also using round robin. The switch allocation does not perform maximum matching due to its large
costs. Rather, a separable-input-first allocation is used for the switch allocation. It offers lower costs,
but saturation rate is higher; this issue for saturation is actually negligible for the communication
networks with only few VCs [8]. Routing decisions are made in routing computation unit. The
crossbar connects input ports and output ports. It has MUX-based architecture to minimize the
control signals. We do not connect inputs and outputs, which are not possible based on the routing
algorithm (cf. implementation details in Sec. 5.2). Thereby, the size of the crossbar may be reduced
by a significant portion, as explained in the results in Sec. 7.2. To summarize, our router supports
rapid testing of different design by means simple setting of the following design parameters, as
shown in Fig. 3:
a) the input port count,
b) the VC count per port,
c) the buffer depth per port and VC and
d) the turns forbidden by the routing algorithm for automatic area-reduction of the crossbar
and the allocator.
4.3 Power model
The power models comprise the dynamic energy of the router and of the links in the network. We
take both into consideration, since related work showed that links cause between 17 % in the Teraflop
router [24] and up to 53.9 % in the NOSTRUM 8×8-NoC [32, Table 1, p. 4] of the overall energy
consumption of the network. Extensive work on the power consumption of routers has already
been conducted: ORION 3.0 [25] provides the most detailed models for energy consumption of the
, Vol. 1, No. 1, Article . Publication date: January 2020.
10 J.M. Joseph et al.
...
...
0 n
0,0
...
...
0 n
0,m0
VC
0
VC
m0
...
0 n
k,0
...
...
0 n
k,mk
VC
0
VC
mk
Input
Port 0
Input
Port k
Output
Port 0
Output
Port k
Central Crossbar
c) configured buffer depthper port and VC
b) configured VC count per port
d) small crossbar byremoving unused turns
a) configured input port count
Routing Computation
Switch Allocation
VC Allocation
Central Control Unit
Input Unit
Input Unit
Fig. 3. Router schematic.
individual router’s components. The framework abstracts the power consumption for different
technology nodes. ORION 3.0 is not directly linked to a NoC simulator. TheNoxim simulator includes
power models into NoC simulations, as introduced in [9, Sec. 5.1, pp. 14-15]. In essence, Noxim
counts power-relevant events in the router such as buffer write and buffer read, routing calculation
etc. The values for each event are gathered from other sources (e.g. bit-level simulations or models
such as ORION). Because of these extensive related works, we do use the same power models as
Noxim within our simulator; the implementation in C++ is briefly introduced in Section 5.3. We do
not include a further description of the models here and kindly refer to [9, Sec. 5.1, pp. 14-15].
There has been some research on the power consumption of links. While ORION includes a basic
power model, as introduced in Eqs. 10 and 11 in Ref. [25], the effects of pattern-dependency are
not accounted for. However, we showed in [6] that this can lead up to 79.77% modeling error. One
option to leverage this error would be bit-level accurate link simulations; this, however, is too slow
for a simulation of a complete NoC. Another option is to characterize data flows by their typical
switching activities, modeled using the colors, such that flows with similar coupling switching
properties are annotated with the same color in the application model. This can be used to estimate
switching activities precisely. Therefore, we introduce the concept of data-flow matrices M for each
link in NoC simulations. The entries of a data-flow matrixM depend on the NoC hardware, i.e. VC
count, arbitration, buffer depth, topology, and the application. For the latter, we use the colored
Petri-net application model (cf. Section 4.1), in which each transmission between two tasks in an
application is annotated with a color σi from the set of all colors Σ = {σ1, . . . ,σn}. A data-flow
matrix therefore denotes the transitions between different colors on that link and also denotes, if
a link was idle or used. For n colors, each data-flow matrixM has the size [0, 1]2n+3×2n+3; It has a
row and column for each color both as active (data of that color have been sent) or idle (last time
the link was active, data of that color have been sent), head flits (active/idle) and an initial state
until data were sent the first time via this link. In Ref. [20], we introduced a toy example using a
simulation of a router without VCs, which is passed by two data streams with different colors and
different injection rates of 0.5625 flits/cycle and 0.1875 flits/cycle. This results in a transmission
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 11
matrix such as:
©­­­­­­­­­­­­«
idle (head, data) (head, idle) (σ1, data) (σ1, idle) (σ2, data) (σ2, idle)
idle 0.010 0.000 − − − − −
(head, data) − − 0.019 0.042 − 0.014 −
(head, idle) − − 0.007 0.013 − 0.005 −
(σ1, data) − 0.041 − 0.326 0.126 − −
(σ1, idle) − 0.013 − 0.113 0.039 − −
(σ2, data) − 0.014 − − − 0.114 0.045
(σ2, idle) − 0.006 − − − 0.040 0.014
ª®®®®®®®®®®®®¬
To highlight the expressiveness of the matrix, it shows this example that 32.6% of all transmissions of
color σ1 are followed by the same color (Matrix read row-wise). Based on these pieces of information,
the power consumption of the network can be calculated: Different colors represent data with
different coupling and switching activities. Thus, the power consumption for packets with the same
color is similar. This abstraction allows for precise and fast power estimation. At this point, we do
not dig deeper into the details of the power model such as the physical formulae, as it has been
published in Ref. [6]. However, the implementation of the power model in Python is discussed in
Sec. 5.3. The performance and accuracy of our approach are introduced and discussed in Sec. 7.3.
5 IMPLEMENTATION
5.1 NoC simulator using C++ and SystemC
The NoC simulator inside the Ratatoskr framework is implemented in C++14 using SystemC
2.3.3 class library, which provides a simulation kernel. All design parameters can be set without
modifications to the code and therefore without recompiling. Only if novel functionality is required,
the core will change; this is easy due to our clear class hierarchy. We start by explaining our
software architecture. Next, we dig into implementation details of interesting components.
5.1.1 Software architecture. A component diagram of the main parts is depicted in (Fig. 4). The
dotted arrows represent the dependency between two components (the component at the arrow’s
source depends on the component at the arrows’ end). We did not draw all the components of
the system for the sake of brevity. One can see that the NoC class reads the configuration of
the simulation from the config.xml file and the configuration of the NoC from network.xml,
both located in a configuration folder. The NoC class then hosts all components of the network:
First, the RouterVC class implements the router model with VCs. Second, the NetworkInterface
class interconnects PEs and routers. Third, the ProcessingElement class implements the PEs by
providing a platform to simulate the application model and hosting tasks. All three components are
connected by SystemC signals for data, flow control, etc. We abstract these in the SingalContainer
class for easy use and maintenance.
The simulator inside Ratatoskr provides two router models, a small standard router based on [8]
and a vertical high-throughput link/router [19], as well as four routing algorithms. Furthermore,
it is possible to extend the code by adding new classes (e.g. novel router models or other routing
algorithms) that implement the abstract classes provided by our solution (see next paragraphs and
Fig. 5). Therefore, one only has to write a C++ implementation, then register it in the constructor
of Router base class and compile.
5.1.2 Implementation details. In this section we discuss the implementation details and the benefits
of such design. In Fig. 5, we see the general software architecture of the Ratatoskr framework. The
top-level class is Noc. All NoC components are inheriting form the NetworkParticipant class. The
NoC class is the top-level class in the simulator. It contains a vector with all NetworkParticipants.
, Vol. 1, No. 1, Article . Publication date: January 2020.
12 J.M. Joseph et al.
«Folder» 
Configurations
 
 
 network.xml config.xml
«Component» 
RouterVC
«Component» 
Noc
«Component» 
NetworkInterface
«Component» 
ProcessingElement
«Component» 
SignalContainrs
Use
Use
Use
Use
Use
UseUse
Fig. 4. Component diagram of the main components.
NetworkParticipant 
 
+ initialize()
+ bind(conn: Connection*, src: SignalContainer*, dst: SignalContainer*) 
Noc
 
- networkParticipants: vector<NetworkParticipants*> 
NetworkInterface ProcessingElement Router
NetworkInterfaceVC ProcessingElementVC RouterVC
Extends
Extends Extends
BaseRouting
 
+ route(src_node_id: int, dst_node_id: int) : int 
RandomHeteroXYZRoutingHeteroXYZRoutingRandomXYZRoutingXYZRouting
Extends
Fig. 5. Class diagram that shows the modular design of our NoC simulator.
Each NetworkParticipant provides two virtual functions: initialize that initializes compo-
nents and bind that binds the signals of connections between components. This directly follows
the SystemC-modeling style. The NoC simulator provides three types of network participants:
Router, ProcessingElement and NetworkInterface. Each one of these classes is realized through
a concrete class implementing the virtual functions of their base classes (namely initialize and
bind). Furthermore, the behavioral model of the components is implemented in the concrete classes
with the suffix VC. We use modern C++ (C++14) extensively within these models; this does not only
enable reader-friendly code but also a more abstract modeling. For instance, within the router there
are no further SystemC sub-modules; rather, the communication within the router between VC
allocation, arbitration and switch, as introduced in Section 4.2, is realized via data structures from
the standard library. For instance, request and acknowledgments are std::maps that constitute
the router architecture in data. The advantage lies in the option to iterate these data structures.
This flat hierarchy is advantageous both for better software maintenance and higher simulation
performance, as SystemC modules and communication via ports result in context switches during
simulation. The implementation of routing functions is also shown in Fig. 5, on the bottom-right
part. There is a base class BaseRouting, which has a virtual function route; it takes the current
node’s address and the destination’s address as input and calculates an output port. We provide a
set of deterministic, low-overhead routing algorithms for heterogeneous 3D SoCs as published in
[19]. To summarize, we used inheritance and polymorphism programming paradigms to achieve
a highly maintainable and flexible architecture. The user can freely add implementations of new
participants or routing algorithms without breaking or recompiling the code base.
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 13
5.1.3 Detailed configuration using XML files. Using xml files allows for a very fine-grained config-
uration of the NoC under test. We provide an example of the hardware description in Fig. 6. The
file starts with the definition of routers and PE using the nodeTypes tag. Nodes are identified using
unique ids. First, we show the definition of a router with VCs using dimension-ordered routing
and 1GHz clock frequency. Second, a PE is defined with VCs and 1GHz clock frequency. Next, the
nodes are instantiated in the nodes tag. Each node has a position in 3D space, is associated with
one of the nodeTypes and has a unique id. Finally, connections between nodes are defined. Each
connection has two ports. The end-points of the ports are connected using the unique node ids;
furthermore, the VC count is defined and the buffer depth is set (either for all VCs or VC-wise). Via
the connection’s data structure any topology for a heterogeneous 3D SoC can be defined. Since the
xml files tend to be rather long, we provide Python classes for their functional definition.
We also provide an example for the application description using colored statistical Petri-nets with
retention time on places in the code shown in Fig. 7. One exemplary task, which implements places,
is defined there. Tasks have unique ids. In-going and out-going data connections are configured
with the requires and generates tags. Application data are generated after all required data, from
the requires tag, are available. To model the stochastic property, data can be sent to different
destinations, which are grouped into a possibility each that is selected following a given
probability. The timing of tasks and the retention time on places is modeled using the tags start,
duration, repeat, interval and delay. A task is only executed in between the start time and
until its duration is finished. A task will also stop execution, if it was repeated for as many times
as given by the tag repeat. For each repetition, one data send possibility is taken. If tokens are
available, a task sends data to this destination. It will send data count times. There is a delay
in-between sending data, in which the task is idle; this implements the retention time. The colored
Petri net is realized by defining data types as shown in the lower part of the code in Fig. 7. We do
not provide an exemplary mapping file because of its simple structure mapping task ids to PE node
ids.
5.2 NoC router using VHDL
The implementation of the router in VHDL is straight-forward following the modular architecture
introduced in Sec. 4.2. It can be configured for the aforementioned design parameters by using
the python scripts hardware/vhdl_writer.py. These take the config.ini as an input and write
VHDL sources, accordingly. The script also generates a directory as input for synthesis, e.g. with
Synopsis design compiler. There are four network options:
(1) A NoC using a conventional router as introduced in Sec. 4.2.
(2) A NoC using the high-throughput router as introduced in [19].
(3) A NoC with PEs injecting uniform random traffic using the conventional router.
(4) A NoC with PEs injecting uniform random traffic using the high-throughput router.
The two latter options can be used for verification as well as VHDL simulation. Also, the NoC can
be synthesized for FPGA-based prototyping [13], as well.
In addition to the configuration options, we also target low area overhead. Therefore, we remove
turns from the crossbar that are impossible within the routing algorithm to reduce its size. We use a
data structure, in which the possible and impossible turns are stored. Based hereupon, the crossbar
either connects the input to the output port or both to ground. In the latter case, the synthesis
tool optimizes the links and, automatically, the size of the crossbar is reduced without further user
interaction.
We also provide a VHDL traffic generator and receiver based on trace files generated by Python
scripts that are connected to our simulator (or any other high-level tool to generate traffic patterns).
, Vol. 1, No. 1, Article . Publication date: January 2020.
14 J.M. Joseph et al.
<nodeTypes >
<nodeType id="0">
<model value=" RouterVC" />
<routing value="XYZ" />
<clockDelay value ="1" />
</nodeType >
<nodeType id="1">
<model value=" ProcessingElementVC" />
<clockDelay value ="1" />
</nodeType >
</nodeTypes >
<nodes >
<node id="0">
<xPos value ="0"/>
<yPos value ="0"/>
<zPos value ="0"/>
<nodeType value ="0"/>
<idType value ="0"/>
</node >
...
</nodes >
<connections >
<con id="0">
<ports >
<port id="0">
<node value ="0"/>
<bufferDepth value ="16"/ >
<vcCount value ="3"/>
</port >
<port id="1">
<node value ="8"/>
<buffersDepths value ="10, 20, 30"/>
<vcCount value ="3"/>
</port >
</ports >
</con >
...
</connections >
Fig. 6. Example for configuration of a NoC.
<task id = "1">
<start min = "0" max = "0"/>
<duration min = "100" max = "100"/ >
<repeat min = "2" max = "2"/>
<requires >
<requirement id = "0">
<type value = "1"/>
<source value = "0"/>
<count min = "1" max = "1"/>
</requirement >
...
</requires >
<generates >
<possibility id = "0">
<probability value = "1"/>
<destinations >
<destination id = "0">
<delay min = "0" max = "50"/>
<interval min = "10" max = "10"/>
<count min = "3" max = "3"/>
<type value = "1"/>
<task value = "3"/>
</destination >
...
</destinations >
</possibility >
...
</generates >
</task >
<data >
<dataTypes >
<dataType id = "0">
<name value = "image"/>
</dataType >
...
</dataTypes >
</data >
Fig. 7. Example for a configuration of an application.
A so-called "Data Generate Unit" then injects the traffic based on the patterns information on the
packet length and the injection time. Furthermore, we ship a "Data Converter Unit". It allows to
convert the received data from the network to a end-user-friendly data format. Specifically, it allows
to: (1) back/conversion of received data to the original data type and do further processing (like
noise analysis); (2) comparison of the received data (from hardware simulation) and the generated
data (from high-level simulation) for verification and error analysis; (3) reporting of system statistics
just as generated from the high-level model. This hardware-level functionality gives the user an
extended framework for verification, prototyping and emulation.
5.3 Power models using Python
The dynamic router energy is implemented just as in Noxim. Five events are tracked: A buffer write,
a buffer read, popping the head element from a buffer, routing calculation and crossbar traversal.
The occurrences of these events are counted using a power class, which is singleton.
The dynamic link energy is calculated using the aforementioned data-flow matrices during
simulation, which then are fed into a Python implementation of the power model along with
the correct parameters for link width and size, switch activities represented by colors in the
application model and the used technology nodes. The data-flow matrices could be generated
during data sending in the routers. Although this would slightly increase simulation performance,
we implemented a separate Link class. This allows for better maintainability and readability of the
code. The links are modeled cycle-accurate and add an entry to the correct entry in the data-flow
matrices in each clock cycle. Links are modeled unidirectional, because of non-purely synchronous
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 15
from interconnect import Interconnect , Driver , DataStream , DataStreamProb
# define 3D link
phit_width = 16 #transmitted bits per link (incl. flow -control , ECC , etc.)
wire_spacing , wire_width = 0.6e-6, 0.3e-6
TSV_pitch , TSV_radius , TSV_length = 8e-6, 2e-6, 50e-6 # length is constant
metal_layer = 5
ground_ring = False # structure used to reduce TSV noise (affects power)
KOZ = 2 * TSV_pitch
driver_40nm_d4 = Driver.predefined('comm_45nm ', 4) # comm. 40nm driver / or define own via Driver ()
# set parameters of link
interconnect_3D_dig = Interconnect(B=phit_width , wire_spacing=wire_spacing , wire_width=wire_width ,
metal_layer=metal_layer , Driver=driver_40nm_d4 , TSVs=True , TSV_radius=TSV_radius ,
TSV_pitch=TSV_pitch , TSV_length=TSV_length , KOZ=KOZ , ground_ring=ground_ring)
# get maximum length based on target clock frequency
f_clk_dig = 1e9
l_max_3D_dig = interconnect_3D_ms.max_metal_wire_length(f_clk_dig)
# get area of link
TSV_area = interconnect_3D_dig.area_3D
# define three datastreams with different properties
ds1 = DataStream(ex_samp , B=16, is_signed=False) # 16b data stream from specific samples
ds2 = DataStream.from_stoch(N=1000, B=16, uniform=1, ro=0.4) # random dist. ro := correlation (1000 samples)
ds3 = DataStream.from_stoch(N=1000, B=16, uniform=0, ro=0.95, mu=2, log2_std =8) # gaussian
# calculate energy based on data flow matrix from matrix
E_mean = interconnect_2D_dig.E([ds1 , ds2 , ds3], data_flow_matrix)
Fig. 8. Exemplary usage of the link power model.
interaction between routers. When a simulation stops, the data-flow matrices are written into csv
files that then are read by the aforementioned Python model. The Python scripts are given in the
folder power. An interconnect package implements four classes: The class Driver implements
the model for the driver of a link, and the class Interconnect implements the physical models
for the link itself. The class DataStream and DataStreamProb implement the properties of data
streams (i.e. colors in the application graph). An exemplary usage of the link power model is shown
in Fig. 8. First, the properties of a 3D link are defined. Next, the parameters are passed to the link
model. As additional features, the link model does not only calculate power but can also be used for
other physical properties of the links. For instance, one can obtain the maximum length (in mm)
based on target clock frequency (and vice versa); the link area including keep-out-zones is also
provided for any link design. Finally, three data streams are defined. The first is based on samples
and the second and third on expected behavior. Finally, the energy is calculated using these data
streams and a data-flow matrix as given from the simulator.
6 AVAILABLE PPA REPORTS
Our proposed tool provides the PPA measurements for the NoC in the whole chip. In short, the
following is provided:
Performance
• Mean and median flit delay
• Mean and median packet delay
• Mean and median network delay
Power
• Dynamic router power
(Noxim)
• Dynamic link energy
Area
• Router area for a given
technology node
• Link area
After simulation, the Ratatoskr framework generates several reports about the aforementioned
measurements. First, there is a textual report (report.txt); it is the most general and basic informa-
tion, namely the average flit, packet and network latencies. Furthermore, the clock count and delay
per layer and all (normalized) data-flow matrices per link are included. Second, report_Links.csv
contains flattened link data-flow matrices without normalization, for further automated processing.
, Vol. 1, No. 1, Article . Publication date: January 2020.
16 J.M. Joseph et al.
Third, report_Routers_Power.csv contains the dynamic power of each router. Forth, the usage
of virtual channels is reported. The VCUsage folder contains VC usage per each router, with csv
files named after router IDs. The rows of each file denote the ports of a router (in order: local, east,
west, north, south, up and down) and the columns represents the count of VCs used, i.e. in the first
row, all VCs are empty and in the last row all VCs are filled with at least one flit. Fifth and finally,
the usage of buffers is reported, as well. The BuffUsage folder contains the buffer usage of the
routers as csv files, named after router ID and direction. The rows of the file are equal to the buffer
depth and the columns denote the VC numbers. Thereby, a usage count for each buffer element is
given.
When executing the fully-automated tool flow using the Python scripts run_simulation.py,
Ratatoskr collects all necessary pieces of information from the previous files and generates a pdf
file with a visual summary. It includes three types of plots: The network performance, i.e. the
latency over injection rate; the average buffer usage per layer and direction as 3D histograms
and the average VC usage per layer as histogram. The exemplary plots are given in Fig. 9. In
Fig. 9a the network performance is given for different injection rates for an exemplary network
configuration, namely a 4×4×4-NoC with three heterogeneous technologies.2 Our reports include
the standard deviation as well as the mean of the relevant latencies. In Fig. 9b the VC usage is given
for the lower layer (therefore, the downwards direction is never used). One can see, that higher VC
numbers are less used, which is in line with the router model. Also the pressure on west and east is
higher, which is a consequence of XYZ routing and round-robin VC arbitration. Finally, in Fig. 9c a
histogram is given, which reports for one exemplary direction in a layer the number of times, a
certain buffer usage and VC usage were given. These three reports available from Ratatoskr give
in-depth insight into the network dynamics and allow for router parameter optimization even on a
micro-architectural level, e.g. for single buffer elements.
To summarize, the automatically generated reports available by our framework provide in-depth
insight into the NoCs PPA. Especially the buffer usage and the VC usage that extend the related
work, as this feature is new. It is tailored towards heterogeneous 3D integration, because we average
the reports per layer. Thereby, routers can be optimized in each layer, an thus per technology node,
to meet both the technological and application’s requirements.
7 RESULTS
7.1 Simulation performance
The simulation time for different injection rates is shown in Fig. 10. We simulate a 4×4 NoC with 32
flits per packet, 4 flit deep buffer, 4 VCs and dimension order routing (XY-routing) with Booksim
2.0 (depicted as green rectangles), Noxim (orange circle) and our simulator inside Ratatoskr (blue
cross). Uniform random traffic is injected into the network. The injection rate is increased from
0.015 flits/cycle to 0.08 flits/cycle in steps of 0.05 flits/cycle. We run 10 simulations using Ubuntu
18.04 on a single core of an Intel i7-6700 at 4GHz. Fig. 10 reports the median and the standard
deviation of the run time. We did not run any other programs alongside (beside system services)
to reduce side-effects. Both Noxim and Ratatoskr do not change their performance with larger
injection rates, while Booksim 2.0 does. For low injection rates, Booksim 2.0 is faster than the other
two competitors; this relation is reversed for injection rates higher than 0.035 flits/cycle. Ratatoskr
is consistently slower than Noxim with approximately four to eight seconds.
The simulation time for different network sizes is shown in Fig. 11. We simulate a NoC of varying
size with the same properties as in the last example. Uniform random traffic is injected into the
network with an injection of 0.03 flits/cycle. The network size is increased from 4×4 to 10×10
2In this example, we use 130 nm, 65 nm and 32 nm technology nodes.
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 17
(a) Network performance.
1 2 3 4 5
0
1
2
3
·105
Number of used VCs
Co
un
t
Layer 0, Injection Rate = 0.07
Local
East
West
North
South
Up
(b) VC usage histogram.
(c) Buffer usage.
Fig. 9. Exemplary reports generated by Ratatoskr after a simulation with uniform random traffic.
2.0×
2.3×
0.02 0.04 0.06 0.08
0
5
10
15
20
injection rate [flits/cycle]
m
ed
ia
n
sim
ul
at
io
n
tim
e
in
[s
]
Booksim 2.0
Noxim
Ratatoskr
Fig. 10. Relation between simulation performance
and injection rate.
4 5 6 7 8 9 10
0
50
100
NoC size in n×n mesh
m
ed
ia
n
sim
ul
at
io
n
tim
e
in
[s
]
Booksim 2.0
Noxim
Ratatoskr
Fig. 11. Relation between simulation performance
and network size.
in steps of 1. Again, we run 10 simulations on the aforementioned machine. Fig. 11 reports the
median and the standard deviation of the run time. The performance relation between the programs
remains the same for all network sizes: Booksim is slower than Ratatoskr, which is slower than
Noxim.
, Vol. 1, No. 1, Article . Publication date: January 2020.
18 J.M. Joseph et al.
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08
50
100
injection rate [flits/cycle]
la
te
nc
y
[n
s]
Flit latency
Packet latency
Fig. 12. Average flit and packet latency for different injection rates.
7.2 PPA of router hardware
We ship an automatically configured and synthesizable hardware implementation along with
each simulation run. Here, we shortly showcase the PPA figures of the high-throughput router
architecture:
The network performance is shown in Fig. 12: The average flit latency and packet latency for
different injection rates from 1% to 8% in [flits/cycle] are shown using a 4×4×4 NoC with dimension-
ordered routing, 4 VCs, 4-flit deep buffer and 1GHz clock speed. We simulated 100,000 clock cycles
injecting uniform random traffic pattern. Both the average and the standard deviation are reported
in the figure.
To report area and power, we synthesize the same router for 45 nm technology at 250MHz
frequency. We compare three experimental setups, as shown in Tab. 2: First, we use a router
with a fully connected crossbar as baseline. Next, we use conventional XYZ dimension-ordered
routing, which removes turns from the crossbar. This router does not account for the possible and
impossible turns from the routing algorithm. Finally, we use Z+(XY)Z- routing algorithm from
[19]. This routing algorithm shows very high performance for heterogeneous 3D SoCs and has
less turns possible than conventional XYZ. We report the numbers for power in [µW] and area
in [µm2] for both the complete router and its crossbar. One individual inner router has a total
cell area of 37899 to 39168 µm2 and a total power of 4.57e+03 to 5.4e+03mW depending on the
routing algorithm. Table 2 also shows the area reductions possible from using information about
the routing algorithms to reduce the size of the router crossbar.
Fully Connected XYZ Routing Algorithm Z+(XY)Z- Routing Algorithm
Router Crossbar Router ∆ Crossbar ∆ Router ∆ Crossbar ∆
Power [µW ] 5.40e+03 183.588 4.49e+03 -17% 64.005 -65% 4.57e+03 -15% 71.162 -61%
Area [µm2] 39168 1288 37942 -3% 894 -31% 37899 -3% 880 -32%
Table 2. Router cost reduction by removing impossible turns of routing algorithm
7.3 Power modeling capabilities
To demonstrate the accuracy of our power models, we compare the estimated link power for a NoC
with and without 4 VCs against a bit-level accurate simulation. The results are based on the case
study provided in [23, Sec. 8 and Fig. 7], in which a 3D Vision SoC is simulated that consists of
one layer in mixed-signal technology and one in memory technology. Six analog-digital converters
in the mixed-signal layer send their 512×512-pixel image data to a single memory in the adjacent
layer via a NoC. The injection rate of image traffic is set to 20% per sensor. As shown in Tab. 3, the
used models are always within 2.4% of the bit-accurate simulations, while conventional models
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 19
Energy per transmitted flit [pJ]
without VCs ∆ with four VCs ∆
Baseline: bit-accurate simulations 2.39 4.18
Conventional model (without switching & coupling) 2.40 <1% 2.40 42.5%
Used model [23] 2.40 <1% 4.28 2.4%
Table 3. Accuracy of used power models, numbers based on case study from Ref. [23].
that do not account for pattern-dependency and the impact of virtual channels yield an error of up
to 42.5%. For a NoC without VCs, both the conventional and Ratatoskr’s power models yield a low
error of <1%. Neither Booksim nor Noxim provide this power analysis feature; the accuracy for
links is higher than ORION 3.0’s power estimation (which is not embedded into simulations).
8 DISCUSSION
The simulation performance is evaluated for varying injection rates and network sizes: First,
concerning the simulation time for different injection rates, see Fig. 10. Booksim 2.0 has a very high
performance for small injection rates (by disabling router without traffic). However, the simulation
speed is reduced linearly with higher injection rates. For injection rates higher than 0.02 flits per
cycle, Noxim is faster than Booksim. For injection rates higher than 0.035 flits per cycle, Ratatoskr’s
simulator is faster than Booksim. In fact, for the highest injection rate evaluated (0.08 flits/cycle)
Booksim 2.0 is 2.3× slower that Ratatoskr. Both Noxim and Ratatoskr have a constant simulation
time independent from the injection rate, but Ratatoskr is 2× slower than Noxim. This is not a
result of less performing router or application model. The impact of the application model is very
small, as we already evaluated in [22, p. 6], where it was 1/30 of the overall simulation time. Rather,
the slower performance is a direct result of the added functionality of the power model for links
and detailed buffer usage statistics as for every link in every clock cycle, large data structures are
written. This can easily be shown using profiling (using gcc -pg). Thus, our simulator is slower
than state-of-the-art but offers more features, but this is an acceptable compromise for the power
model accuracy (see below). Second, concerning simulation time for different network sizes, shown
in Figure 11, one can see that all three simulators are linearly slower for more routers. Furthermore,
the performance difference is constant. This is expected since every router is simulated for each
clock cycle. To summarize, our simulator has the same performance as state-of-the-art if one does
not account for extended features; the new features reduce performance but allow for better quality
of results, as we will discuss in the next paragraph.
The PPA of the router is evaluated by synthesis of the high-throughput router for 45 nm node.
The results are shown in Table 2. We report the reduction in router power and area by removing the
impossible turns of Z+(XY)Z- routing algorithm [19] and conventional XYZ routing algorithms. The
power of the crossbar is reduced by 61% and 65%, respectively. This has a 15% to 17% positive effect
of the total router power. The area of the crossbar is reduced by approx. 31%–32%; this reduces the
total router area by approximately 3%. The power and area enhancements are not affecting the
router’s performance, as the turns in the removed turns in the crossbar had not been taken.
Our used power models are more accurate than conventional models without considering
pattern-dependent switching coupling and virtual channels as clearly shown in Tab. 3: The error
of our models is 2.4% while conventional models yield up to 42.5% error for a given case study.
These very good results are a consequence of the usage of data-flow matrices in the simulator.
Therefore, the price of this low modeling error is reduced simulation performance in comparison to
conventional models. The data-flow matrices contribute to the 2× reduced simulator performance
, Vol. 1, No. 1, Article . Publication date: January 2020.
20 J.M. Joseph et al.
in comparison to Noxim. However, we strongly advocate this feature. We are convinced that
the reduced simulator performance is a price well-payed for the modeling error. The only other
viable option to get such good results are bit-accurate simulations. However, those have much
worse performance than our data-flow matrices. A trace file of the data transmissions along each
link would need to be written. While our data flow matrices have a constant size and a memory
complexity of O(n2), with n colors, generation of bit-accurate trace file has a non-constant size
linearly increasing with the simulation time t . Since usually the simulation time is much larger than
the number of different data streams (i.e. colors) in the application graph (t >> n), the performance
gain and memory reduction are significant.
For the sake of completeness, we also briefly discuss the router architectures properties. As one
can see in Fig. 12, the network saturates after 7% injection rate, which is expected for the chosen,
light-weight architecture. Since the focus of this submission is not a novel router architecture but a
simulation tool that ships a hardware implementation on top of the actual core simulator, we do
not compare against other router implementations.
To summarize, the Ratatoskr framework generates more accurate results than state-of-the-art
competitors at the cost of slightly reduced simulation performance. Since accurate power results
are key in heterogeneous 3D SoC due to layers in mixed-signal technology with a high base power
consumption, Ratatoskr tackle one of the most important issues. Furthermore, we ship a hardware
implementation, which is automatically generated for each simulation run. Since the whole tool
uses a single-point of entry and easy-to-use configuration interfaces it is very user friendly and
allows for rapid prototyping. In addition, more detailed configuration interfaces are also provided
if non-standard parameters are to be set and optimized. Thus, the proposed design and simulation
tool allows to efficiently build NoC for heterogeneous 3D SoCs.
9 CONCLUSION
In this work, we introduce Ratatoskr, an open-source framework for in-depth PPA analysis in 3D
NoCs. We also support heterogeneous 3D integration as it has become one of the key innovations
to build more efficient systems. The framework Ratatoskr is implemented in C++, SystemC, Python
and VHDL. It offers power estimation of routers and links on a cycle-accurate level. The accuracy
of our models for the dynamic power of links is within 2.4% accuracy of bit-level simulations while
maintaining cycle-accurate simulation speed. The performance of the NoC can be measured on CA
level for long simulations using traffic injected from TL-modeled applications or synthetic patterns,
on RTL for shorter simulations using synthetic patterns, and on gate-level for router timing. The
hardware implementation of the routers can be synthesized for standard cell technologies and
includes a power and area saving feature that removes unused turns in the crossbar based on
information about the routing algorithm. This saves up to 32% total router power and 3% router
area compared to a conventional router without these features. The whole framework evolves
around a single configuration file that allows to set the most important design parameters easily,
but more detailed and more complex configuration options are also available. The framework
generates user reports to assess designs. With this wide range of features, Ratatoskr is the first
comprehensive framework for PPA-analysis in NoCs also comprising heterogeneous 3D integration.
It will participate to tackle important issues for state-of-the-art chips due to the increasing relevance
of heterogeneity and the prevalent challenges found in on-chip interconnection networks. You’ll
find the source code and usage examples of Ratatoskr at https://github.com/jmjos/ratatoskr.
ACKNOWLEDGMENTS
This work is funded by the German Research Foundation (DFG) projects PI 447/8 and GA 763/7.
, Vol. 1, No. 1, Article . Publication date: January 2020.
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs 21
REFERENCES
[1] [n.d.]. Intel Previews New Hybrid CPU Architecture with Foveros 3D Packaging. https://newsroom.intel.com/video-
archive/video-intel-previews-new-hybrid-cpu-architecture-with-foveros-3d-packaging/. Accessed: 2019-05-17.
[2] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model
inside a full-system simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International
Symposium on. 33–42.
[3] anan-cn. 2014. Open-Source Network-on-Chip Router RTL. https://github.com/anan-cn/Open-Source-Network-on-
Chip-Router-RTL. Accessed: 2019-03-15.
[4] L. Bamberg and A. García-Oritz. 2017. High-Level Energy Estimation for Submicrometric TSV Arrays. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems 25, 10 (2017), 2856–2866. https://doi.org/10.1109/TVLSI.2017.2713601
[5] Lennart Bamberg, Jan Moritz Joseph, Thilo Pionteck, and Alberto Garcia-Ortiz. 2019. Crosstalk optimization for
through-silicon vias by exploiting temporal signal misalignment. Integration 67 (2019), 60 – 72. https://doi.org/10.
1016/j.vlsi.2019.04.009
[6] L. Bamberg, Joseph, J. M., R. Schmidt, T. Pionteck, and A. García-Oritz. 2018. Coding-aware Link Energy Estimation
for 2D and 3D Networks-on-Chip with Virtual Channels. International Symposium on Power and Timing Modeling,
Optimization and Simulation (2018), 222–228.
[7] Daniel U. Becker. 2012. Efficient Microarchitecture for Network-on-Chip Routers. Stanford University.
[8] Daniel U. Becker and William J. Dally. 2009. Allocator Implementations for Network-on-chip Routers. In Proceedings
of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 1–12. https://doi.org/10.
1145/1654059.1654112
[9] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. 2016. Cycle-Accurate Network on Chip Simulation with
Noxim. ACM Transactions on Modeling and Computer Simulation 27, 1 (2016), 1–25. https://doi.org/10.1145/2953878
[10] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2018. Eyeriss v2: A Flexible and High-Performance Accelerator for
Emerging Deep Neural Networks. CoRR abs/1807.07928 (2018). arXiv:1807.07928 http://arxiv.org/abs/1807.07928
[11] X. Dong and Y. Xie. 2009. System-level cost analysis and design exploration for three-dimensional integrated circuits
(3D ICs). Asia and South Pacific Design Automation Conference (2009). https://doi.org/10.1109/ASPDAC.2009.4796486
[12] Jack Dongarra. 2016. Report on the Sunway TaihuLight System. Technical Report. University of Tennessee, Oak Ridge
National Laboratory. http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
[13] T. Drewes, Joseph, J. M., and T. Pionteck. 2017. An FPGA-based prototyping framework for Networks-on-Chip. In
International Conference on ReConFigurable and FPGAs. IEEE, 1–7. https://doi.org/10.1109/RECONFIG.2017.8279775
[14] F. Fatollahi-Fard, D. Donofrio, G. Michelogiannakis, K. Shalf, J. adn Wen, J. Bachan, and D. Burke. 2019. OpenSoCFabric.
http://www.opensocfabric.org/home.php. Accessed: 2019-03-15.
[15] Fabrizio Fazzino, Maurizio Palesi, and David Patti. 2008. Noxim: Network-on-chip simulator. URL: http://sourceforge.
net/projects/noxim (2008).
[16] P. E. Garrou, M. Koyanagi, and P. Ramm. 2009. 3D process technology: Robust circuit and physical design for sub-65 nm
technology nodes (first edition ed.). Handbook of 3D integration, Vol. volume 3. Wiley, Hoboken, NJ.
[17] Nan Jiang, James Balfour, Daniel U. Becker, Brian Towles, William J. Dally, George Michelogiannakis, and John Kim.
2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In International Symposium on Performance
Analysis of Systems and Software. IEEE, 86–96. https://doi.org/10.1109/ISPASS.2013.6557149
[18] N. Jiang, G. Michelogiannakis, D. Becker, B. Towles, and W. Dally. [n.d.]. Booksim interconnection network simulator.
([n. d.]). nocs.stanford.edu
[19] J. M. Joseph, L. Bamberg, D. Ermel, B. R. Perjikolaei, A. Drewes, A. García-Ortiz, and T. Pionteck. 2019. NoCs in
Heterogeneous 3D SoCs: Co-Design of Routing Strategies and Microarchitectures. IEEE Access 7 (2019), 135145–135163.
https://doi.org/10.1109/ACCESS.2019.2942129
[20] Joseph, J. M., L. Bamberg, G. Krell, I. Hajjar, A. García-Oritz, and T. Pionteck. 2018. Specification of Simulation
Models for NoCs in Heterogeneous 3D SoCs. International Symposium on Reconfigurable Communication-centric
Systems-on-Chip (2018), 1–8.
[21] Joseph, J. M., C. Blochwitz, A. García-Ortiz, and T. Pionteck. 2017. Area and power savings via asymmetric organization
of buffers in 3D-NoCs for heterogeneous 3D-SoCs. Microprocessors and Microsystems 48 (2017), 36–47. https:
//doi.org/10.1016/j.micpro.2016.09.011
[22] Joseph, J. M. and T. Pionteck. 2014. A cycle-accurate Network-on-Chip simulator with support for abstract task graph
modeling. In International Symposium on System-on-Chip. IEEE, 1–6. https://doi.org/10.1109/ISSOC.2014.6972440
[23] Joseph, J. M. co-shared with Bamberg, L., I. Hajjar, R. Schmidt, T. Pionteck, and A. García-Ortiz. 2019. Simulation
Environment for Link Energy Estimation in Networks-on-Chip with Virtual Channels. INTEGRATION, the VLSI journal
(2019).
[24] A. B. Kahng, Bin Li, L. S. Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model
for early-stage design space exploration. In Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09.
, Vol. 1, No. 1, Article . Publication date: January 2020.
22 J.M. Joseph et al.
423–428. https://doi.org/10.1109/DATE.2009.5090700
[25] Andrew B Kahng, Bill Lin, and Siddhartha Nath. 2015. ORION3.0: a comprehensive NoC router estimation tool. IEEE
Embedded Systems Letters 7, 2 (2015), 41–45.
[26] G. Katti, M. Stucchi, K. de Meyer, and W. Dehaene. 2010. Electrical modeling and characterization of through silicon
via for three-dimensional ICs. IEEE Transactions on Electron Devices 57, 1 (2010), 256–262. https://doi.org/10.1109/TED.
2009.2034508
[27] K. Kim, S. Lee, J. Y. Kim, M. Kim, and H. J. Yoo. 2009. A 125 GOPS 583 mW Network-on-Chip Based Parallel
Processor With Bio-Inspired Visual Attention Engine. IEEE Journal of Solid-State Circuits 44, 1 (2009), 136–147.
https://doi.org/10.1109/JSSC.2008.2007157
[28] K. Krishna and H. Kwon. 2017. OpenSMART. http://synergy.ece.gatech.edu/tools/opensmart/. Accessed: 2019-03-15.
[29] H. Kwon and T. Krishna. 2017. OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel. In 2017 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS). 195–204. https://doi.org/10.1109/
ISPASS.2017.7975291
[30] K. Lepak, G. Talbot, S. White, N. Beck, and S. Naffziger. 2017. The next generation AMD enterprise server product
architecture. In Hotchips 29.
[31] I. L. Markov. 2014. Limits on fundamental limits to computation. Nature (2014).
[32] S. Penolazzi and A. Jantsch. 2006. A High Level Power Model for the Nostrum NoC. In 9th EUROMICRO Conference
on Digital System Design: Architectures, Methods and Tools, Venki Muthukumar (Ed.). IEEE Computer Society, Los
Alamitos, Calif., 673–676. https://doi.org/10.1109/DSD.2006.9
[33] X. Yu, L. Li, Y. Zhang, H. Pan, and S. He. 2013. Performance and power consumption analysis of memory efficient 3D
network-on-chip architecture. International Conference on Control and Automation (2013). https://doi.org/10.1109/
ICCA.2013.6565107
[34] Á. Zarándy. 2011. Focal-plane sensor-processor chips. Springer.
, Vol. 1, No. 1, Article . Publication date: January 2020.
