SMARTs: A Tool to Simulate and Analyze the Performance of Real-time Multi-core Systems  by Sharma, Mridula et al.
 Procedia Computer Science  34 ( 2014 )  544 – 551 
1877-0509 © 2014 Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Selection and peer-review under responsibility of Conference Program Chairs
doi: 10.1016/j.procs.2014.07.067 
ScienceDirect
Available online at www.sciencedirect.com
2014 International Workshop on the Design and Performance of Network on Chip
(DPNoC 2014)
SMARTs: A Tool to Simulate and Analyze the Performance of
Real-Time Multi-core Systems
Mridula Sharmaa,∗, Haytham Elmiligib, Fayez Gebalia
aUniversity of Victoria, Victoria, BC, Canada
bComputing Science Department, Thompson Rivers University, Kamloops, BC, Canada
Abstract
In this paper, we are introducing a new tool for real-time multi-core systems called SMARTs: Simulating Multi-core
Activities for Real Time systems. The proposed tool encapsulates four phases of multi-core system design for real-time
applications. Starting from parsing codes and generating dependency graphs to task scheduling and mapping on multi-
core systems. The tool also provides the possibility of implementing diﬀerent scheduling and mapping algorithms for
multi-core process systems and evaluating the performance when changing various design parameters. As a proof of
concept, We present two diﬀerent case studies in this paper to explain diﬀerent features of the proposed tool and how it
could be used to speed up the development time.
c© 2014 Elsevier Ltd. All rights reserved
Peer-review under responsibility of the Program Chairs of FNC-2014
Keywords: Systems-on-Chips, multicore, task scheduling, task mapping, SMARTs, Sequence Allocator, Dependence
Generator, Multi-core Mapper.
1. Introduction
The current trend in the silicon industry depends heavily on utilizing multi-core chips as they provide
high-speed and better-performance solutions compared to the single-core ones [1]. Multi-core implemen-
tation could be homogenous or heterogeneous. In homogenous multi-core systems, all cores are similar,
whereas in heterogeneous systems, cores are diﬀerent. In most cases in heterogeneous systems, one core is
used as a general purpose and others are synergistic processor cores (coprocessors).
Designers of multi-core systems are always looking for fast processing, low power consumption, and op-
timum utilization of resources. Since the design of multi-core systems is complex, guaranteeing an optimum
performance is challenging [2]. The performance evaluation itself is not an easy task. Various techniques
∗Corresponding author
Email addresses: naina@uvic.ca (Mridula Sharma), haytham@ieee.org (Haytham Elmiligi), fayez@uvic.ca
(Fayez Gebali)
© 2014 Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Selection and peer-review under responsibility of Conference Program Chairs
545 Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
have been proposed to evaluate the performance of multi-core systems. The problem becomes even more
complex when we target real-time systems (RTSs) because formal veriﬁcation of timing properties is crit-
ical and important. Furthermore, the shared cache, or message transfer amongst cores makes performance
evaluation of real-time multi-core systems (RTMSs) more challenging [1].
In this paper, we are introducing a new tool for RTMSs called SMARTs: Simulating Multi-core Ac-
tivities for Real Time systems. SMARTs is a promising tool that aims at simulating and analyzing the
performance of RTMSs. The tool takes the source code that was developed for a single-core application and
gives the designer the ability to simulate the execution of this code on diﬀerent multi-core platforms. It also
helps designers to explore diﬀerent solutions and compare the system performance when changing diﬀerent
design parameters, such as number of cores, queue size, scheduling algorithm, mapping algorithm, etc.
The paper starts with Introduction. Section 2 reviews similar work. In Section 3, we provide a brief
description about the tool and the working of the tool is explained in section 4. Section 5 presents two case
studies and Section 6 concludes the paper.
2. Literature Review
A lot of work has been done in the literature to propose diﬀerent solutions to analyze the performance of
RTMSs. Deng and Purvis [3] discussed a generic model using tandem queuing with the presumption that the
parallelization is only possible for applications that can work in pipeline. As per this model, an application
can be split apart into independent procedures, and each procedure can be served by one or more cores in
parallel. The cores are allocated based on the processing time needed for the procedure. Deng and Purvis [3]
derived a lemma stating that the number of the servers must be assigned to the system in proportion to the
square roots of their processing time respectively, to gain the minimal time in system. Two test analysis are
presented using Snort parallelization and Image retrieval using Poise [4] [5].
Jing Lee and Kunal [6] presented a model for the decomposition of the parallel tasks into a set of
sequential tasks and assign appropriate release time and deadlines to them. Parallel synchronous task sets
were generated randomly and each task is assigned a valid period. The tasks were added to the cores till the
total utilization is not exceeded. The tasks were scheduled using global Earliest Deadline First (EDF) and
Partitioned Deadline Monotonic (PDM) scheduling. Sudipta and Chong [1] considered shared cache and
a shared bus for the multiple cores. They deﬁned a uniﬁed analysis framework to feature the components
like pipeline, shared cache and bus through the use of TDMA based round robin arbitration policy to assign
a ﬁxed length bus to each core. They used Least Recently Used (LRU) replacement policy for the cache
replacement.
Chen et al. in [7] used the Heterogenous Dual-core Scheduling (HDS) algorithm in [7] to break down
each task into sub-tasks with assumption that processor sub-tasks can be considered as a set of sporadic
tasks that are scheduled by EDF. Speciﬁcally used for heterogeneous multi-core systems, this real-time task
scheduler algorithm worked on non-preemptive co-processor and preemption points were inserted into co-
processor sub-tasks using program slicing so that the coprocessor can be semi preemptive [7]. Experimental
work was done to compare the HDS to EDF, Rate Mono-tonic (RM), Priority Ceiling Protocol (PCP), and
Stack Resource Protocol (SRP) algorithms [7]. The task set was generated by TGFF [8].
Another work by Zhaobin and Wenyu [9] presented the analysis based on the pipe-lining technology
that proposes a reverse interleaved pipe-lining algorithm (RIPA) scheduling strategy to decrease the total
I/O execution time by balancing the workload on the homogenous processors. The self-adaptive scheduling
method was used to select the most appropriate core to process the I/O task in the heterogeneous multi-
core systems. They consider independent I/O tasks that do not interact and arrive in the regular intervals.
Self-adaptive scheduling works on random I/O tasks environment.
Directed-Acyclic Graph (DAG) schedulers are required for the multi-core processors. A new approach
is proposed for the task’s children of diﬀerent deadlines by Karim and David [10]. The novel uniﬁed DAG
monitoring solution known as DAG Flow Manager(DFM) is low-complexity, fully independent and does
not impose any restrictions on DAG. It allows the simple connected schedulers to have optimal control of
the core assignments. The solution is tested on H.264 decoder with diﬀerent DAG conﬁgurations.
546   Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
Special hardware architectures like Multi-core Execution of Hard Real-time applications supporting
analyzability (MERASA) are also designed for guaranteed analyzability and timing predictability of the
multi-core systems [11]. The main objective of the architecture is to make the analysis of each task inde-
pendent from the co-scheduled tasks to provide a safe and tight WCET estimations by isolating the tasks
execution from the inter-task interference in both hardware and software. In this project thread scheduling
is done by the hardware which isolates and prioritizes the Hard real time and Non-Hard real time threads
running on the same core. Analyzable real-time Memory controller (AMC) is designed to minimize the
eﬀect of the inter-task interference in the memory. Another dynamic heterogeneous multi-core architecture
is proposed by Mihai and Tulika at National University of Singapore, known as Bahurupi (exist in many
forms) [12]. The model works on the creation of the number of threads to execute number of tasks. Here
core corresponds to threads and basic blocks to the tasks. The core fetch basic blocks to execute and once
completed the execution, it fetches the next available basic block. The main task here is to resolve register
and memory dependencies amongst the basic blocks. The important requirement in the Bahurupi model is
to detect inter-dependencies amongst the basic blocks.
Miao Ju and Hun Jung proposed the core/thread combinations to improve the performance of the multi-
core systems [13]. A fast, packet latency estimation algorithm is developed that works at the thread level,
overlooking the instruction level and micro-architectural details. They worked on the communication pro-
cessors consisting of set of cores and each core may run more than one thread. They scheduled the threads
based on a given thread scheduling discipline. Packets are partitioned and are allocated to diﬀerent cores
by the dispatcher. The code path is deﬁned which is the sequence of events with event inter-arrival time or
event segment length. Table 1 compares the diﬀerent literature work discussed above.
3. Proposed Design Flow
In order to analyze and evaluate the performance of RTMSs, designers have to explore various design
parameters on four diﬀerent levels of abstraction. Our proposed tool provides a comprehensive solution that
dresses the design requirements on all these four levels and generates the overall CPU utilization at the end.
The tool can take any high level program that was written for a single core as an input and the designer can
set the number of cores to be as needed. The tool then automatically generates the task sets, allocates tasks
in queues, performs task mapping, and ﬁnally generates the net CPU utilization result.
Figure 1 shows a simpliﬁed abstraction of the basic operations of the tool during the targeted four phases.
The four phases are explained below.
Fig. 1. A simpliﬁed abstraction of the basic operations in real-time systems.
547 Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
Table 1. Comparison of diﬀerent real-time task management proposed in the literature.
Reference
Task
Veriﬁcation
Depend. Sequence
SchedulerMapper
management analyzer allocator
Deng and
Purvis [3]
A Generic model
using tandem
queuing
Using Snort and
Poise
 X X X
Chen et al. [7] HDS Algorithm HDS compared
to EDF,RM, PCP
and SRF
X X  X
Saifullah and
Li [6]
EDF and PDM
Scheduling
Simulation results X X  X
Sudipta and
Chong [1]
TDMA based
Round Robin
Arbitration Policy
Implementation
using Chronos
X X X 
Zhaobin and
Wenyu [9]
RIPA and self-
adaptive Schedul-
ing
Simulation on HP
ProLiant DL380
X X  X
Karim and
David [10]
DAG Flow Man-
ager(DFM)
H.264 decoders   X X
MERASA
Project [11]
Analyzable real-
time Memory
controller (AMC)
Use of OTAWA
and RapiTime
WCET tools
X X  
Bahurupi archi-
tecture [12]
A polymorphic
heterogenous
multi-core archi-
tecture
SimpleScalar
Simulator
  X X
Miao Ju and Hun
Jung [13]
Thread scheduling
discipline dimen-
sion
Simulation tool
and Model
 X X X
3.1. Phase-I: Dependence Generator
In this phase, the tool extracts the dependencies among the tasks and uses a regular iterative algorithm
(RIA) to obtain an directed acyclic graph (DAG), which is in turn converted to the task dependability matrix.
The task dependency matrix is the input for this tool at present with the execution time of each task [14].
3.2. Phase-II: Sequence Generator
In this phase, the tool generates a sequence set for all tasks. Each sequence set consists of a number of
independent tasks that could be executed concurrently. Task sequences are sorted based on execution time
of the tasks within the same sequence.
3.3. Phase-III: Task Scheduler
Once the Sequence Generator has ﬁnished the allocation of sequences, the task scheduler starts schedul-
ing tasks in each sequence to be executed based on a speciﬁc scheduling algorithm, such as EDF, RM, etc.
Task Scheduler deﬁnes the order of task execution by assigning a priority level to each task and transfers the
highest priority tasks to the ready queue.
548   Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
3.4. Phase-IV: Task Mapper
Finally, the task mapper uses a mapping algorithm to assign tasks from the ready queue to actual mi-
croprocessors based on the availability of these tasks in the ready queue and the processors at each tick
time. The mapping algorithm aims to maximize the multi-core system eﬃciency by optimizing processor
utilization.
Using SMARTs, designers will have full control over the design ﬂow. They can change the design
conﬁguration and test diﬀerent scenarios to ﬁnd the optimum design for a target application.
4. SMARTs: Simulating Multi-core Activities for Real Time systems
SMARTs is an object oriented tool that has a node object and other relevant methods. Within the tool,
each task is instantiated as a node. The node declaration provides an ID to each node, generates the sources
set, sink set, and assigns a sequence and execution time period to each task. The nodes take the values from
the task dependability matrix, and through ﬁnd the schedule() function, task sequences are generated during
the second phase. The task sequences contain tasks that can run in parallel. The function assign priority()
generates the task queue based on execution time periods of each task. Finally, the function task mapping()
is used to map tasks to diﬀerent cores for the ﬁnal execution. For the purpose of creating the utilization
matrix, a ﬁll cores() function is deﬁned.
As a proof of concept, SMARTs is tested with two diﬀerent case studies having programs written in ’C’
language, one of which is adopted from the benchmark ’C’ code[15].
5. Case Studies
5.1. Case Study I
The ﬁrst case study is a simple ’C’ program for calculating the maximum, minimum, sum, and average
of two arrays.
5.1.1. Phase-I: Dependence Generator
The task dependency matrix (M) of the code is shown below.
M =
⎡
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
0 0 1 1 1 1 0 0 0
0 0 1 1 1 1 0 0 0
0 0 1 1 1 1 0 0 0
⎤
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
5.1.2. Phase-II: Sequence Generator
In this phase, the tool works on the dependency matrix and generates task sequences. Figure 2 shows
the task sequence generated by the tool.
Fig. 2. Task Sequences
549 Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
5.1.3. Phase-III: Task Scheduler
As deﬁned in the sequence set, the tasks are put into the ready queue after allocating a priority to each
task. At present, we are using static algorithm, i.e. RM(Rate Monotonic), where tasks with short execution
time are assigned higher priority. This algorithm is used for scheduling periodic tasks. In future work, more
scheduling algorithms will be added to the tool library so that designers can choose diﬀerent algorithms and
evaluate the systems performance accordingly. Figure 3 shows the output of Phase=III.
Fig. 3. Task Queue
5.1.4. Phase IV : Task Mapper
In this phase, tasks are ﬁnally allocated/mapped to diﬀerent cores for ﬁnal processing. Diﬀerent mapping
algorithms could be used to execute this task. In our case study, each task is allocated to a core and once the
task execution is complete, other tasks of the same sequence is allocated to the same core till all the tasks
of the sequence set are executed. Figure 4 demonstrates the task mapping and the overall time estimation
for processing all tasks on the multi-core system. Figure ?? shows the CPU utilization chart generated by
SMARTs for the given tasks. 0’s represent times when the core is idle and 1’s represent times when the core
is running.
Fig. 4. Task Mapping
5.2. Case study II
A benchmark ’C’ code from [15] is used for another study. This code computes the sum, mean, variance,
standard deviation, and correlation coeﬃcient between two arrays. The similar results are generated for this
case study as well. We only present the result generated by the last phase.
550   Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
5.2.1. Phase-I: Dependence Generator
The task dependency matrix (N) of this code is as follows:
N =
⎡
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0
0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 1 0
⎤
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
5.2.2. Phase-II: Sequence Generator
In this phase, for the benchmark ’C code, 6 sequence sets are generated. Sequence set 0, 1, 2 and 3
contained two tasks each, whereas sequence set 4 a,d 5 contained only one task.
5.2.3. Phase-III: Task Scheduler
Once the tasks are put into the ready queue after allocating a priority to each task through the RM
algorithm, the ready queue is generated. In this phase, the ready queue is generated having tasks prioritized
based on execution time of each task.
5.2.4. Phase IV : Task Mapper
The ﬁnal task mapping is performed here. Through the partitioned mapping algorithm, all the tasks
are mapped and executed. Figure 5 demonstrates the task mapping and the overall time estimation for
processing all tasks on the multi-core system.
Fig. 5. Task Mapping
To show the signiﬁcance of our tool, we changed only one design parameter and compared the CPU
utilization to see which design would be the best. The CPU utilization for the given tasks was 77.083%
551 Mridula Sharma et al. /  Procedia Computer Science  34 ( 2014 )  544 – 551 
when we used two cores and it reduces to 51.38% when three cores are used. This even further degraded,
when we used four cores to 38.54%. This is a 33.3% change in the utilization.
6. Conclusion
In this paper, we presented a new promising software tool that could be used to analyze and evaluate
the performance of real-time multi-core systems. The proposed tool, SMARTs, not only maps a single-
core program into multi-core hardware platform, but also gives designers the capabilities to try diﬀerent
design parameters and compare the system performance at early design phases. For the ﬁrst time, we are
introducing a comprehensive tool that allows designers to explore the design space at four diﬀerent levels.
The tool can parse a high-level language code, generates dependency matrix, allocates tasks into diﬀerent
sequence sets, and maps tasks that are ready to be executed on a multi-core hardware platform. To validate
our proposed tool, we presented two case studies. In the second case study, we were able to show a 33.3%
change in the utilization by just changing one design parameter, which veriﬁes it’s signiﬁcance.
References
[1] S. Chattopadhyay, C. Kee, A. Roychoudhury, T. Kelter, P. Marwedel, H. Falk, A uniﬁed wcet analysis framework for multi-core
platforms, in: 2012 IEEE 18th Real-Time and Embedded Technology and Applications Symposium (RTAS), 2012, pp. 99–108.
doi:10.1109/RTAS.2012.26.
[2] M. P. R. M. Sharma, A. Kumble, S. H, Performance analysis of multicore systems (2009).
URL http://software.intel.com/en-us/articles/performance-analysis-of-multicore-systems-4
[3] J. Deng, M. Purvis, Queueing analysis for multi-core performance improvement: Two case studies, in: Australasian Telecommu-
nication Networks and Applications Conference, 2007, pp. 390–395. doi:10.1109/ATNAC.2007.4665311.
[4] M. Roesch, Snort - lightweight intrusion detection for networks, in: Proceedings of the 13th USENIX Conference on System
Administration, LISA ’99, USENIX Association, Berkeley, CA, USA, 1999, pp. 229–238.
URL http://dl.acm.org/citation.cfm?id=1039834.1039864
[5] D. Deng, H. Wolf, Poise achieving content-based picture organisation for image search engines, in: R. Khosla, R. Howlett,
L. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems, Vol. 3682 of Lecture Notes in Computer
Science, Springer Berlin Heidelberg, 2005, pp. 1–7.
[6] A. Saifullah, J. Li, K. Agrawal, C. Lu, C. Gill, Multi-core real-time scheduling for generalized parallel task models, Real-Time
Systems 49 (4) (2013) 404–435. doi:10.1007/s11241-012-9166-9.
[7] Y.-S. Chen, H. C. Liao, T.-H. Tsai, Online real-time task scheduling in heterogeneous multicore system-on-a-chip, IEEE Trans-
actions on Parallel and Distributed Systems 24 (1) (2013) 118–130.
[8] R. Dick, D. Rhodes, W. Wolf, TGFF: task graphs for free, in: Proceedings of the Sixth International Workshop on Hardware/-
Software Codesign (CODES/CASHE ’98), 1998, pp. 97–101.
[9] Z. Liu, W. Qu, H. Li, M. Ruan, W. Zhou, I/o scheduling and performance analysis on multi-core platforms, Concurrency and
Computation: Practice and Experience 21 (10) (2009) 1405–1417. doi:10.1002/cpe.1421.
URL http://dx.doi.org/10.1002/cpe.1421
[10] K. Kanoun, D. Atienza, N. Mastronarde, M. van der Schaar, A uniﬁed online directed acyclic graph ﬂow manager for multicore
schedulers, in: 2014 19th Asia and South Paciﬁc Design Automation Conference (ASP-DAC), 2014, pp. 714–719.
[11] T. Ungerer, F. Cazorla, P. Sainrat, G. Bernat, Z. Petrov, C. Rochange, E. Quiones, M. Gerdes, M. Paolieri, J. Wolf, H. Casse,
S. Uhrig, I. Guliashvili, M. Houston, F. Kluge, S. Metzlaﬀ, J. Mische, Merasa: Multicore execution of hard real-time applications
supporting analyzability, IEEE Micro 30 (5) (2010) 66–75. doi:10.1109/MM.2010.78.
[12] M. Pricopi, T. Mitra, Bahurupi: A polymorphic heterogeneous multi-core architecture, ACM Trans. Archit. Code Optim. 8 (4)
(2012) 22:1–22:21.
[13] M. Ju, H. Jung, H. Che, A performance analysis methodology for multicore, multithreaded processors, IEEE Transactions on
Computers 63 (2) (2014) 276–289. doi:10.1109/TC.2012.223.
[14] F. Gebali, Algorithms and Parallel Computers, John Wiley, New York, 2011.
[15] M. R.-T. R. C. (MRTC), Benchmarks (2013).
URL http://www.mrtc.mdh.se/projects/wcet/benchmarks.html
