In this paper, we are introducing a new tool for real-time multi-core systems called SMARTs: Simulating Multi-core Activities for Real Time systems. The proposed tool encapsulates four phases of multi-core system design for real-time applications. Starting from parsing codes and generating dependency graphs to task scheduling and mapping on multicore systems. The tool also provides the possibility of implementing different scheduling and mapping algorithms for multi-core process systems and evaluating the performance when changing various design parameters. As a proof of concept, We present two different case studies in this paper to explain different features of the proposed tool and how it could be used to speed up the development time.
Introduction
The current trend in the silicon industry depends heavily on utilizing multi-core chips as they provide high-speed and better-performance solutions compared to the single-core ones [1] . Multi-core implementation could be homogenous or heterogeneous. In homogenous multi-core systems, all cores are similar, whereas in heterogeneous systems, cores are different. In most cases in heterogeneous systems, one core is used as a general purpose and others are synergistic processor cores (coprocessors).
Designers of multi-core systems are always looking for fast processing, low power consumption, and optimum utilization of resources. Since the design of multi-core systems is complex, guaranteeing an optimum performance is challenging [2] . The performance evaluation itself is not an easy task. Various techniques have been proposed to evaluate the performance of multi-core systems. The problem becomes even more complex when we target real-time systems (RTSs) because formal verification of timing properties is critical and important. Furthermore, the shared cache, or message transfer amongst cores makes performance evaluation of real-time multi-core systems (RTMSs) more challenging [1] .
In this paper, we are introducing a new tool for RTMSs called SMARTs: Simulating Multi-core Activities for Real Time systems. SMARTs is a promising tool that aims at simulating and analyzing the performance of RTMSs. The tool takes the source code that was developed for a single-core application and gives the designer the ability to simulate the execution of this code on different multi-core platforms. It also helps designers to explore different solutions and compare the system performance when changing different design parameters, such as number of cores, queue size, scheduling algorithm, mapping algorithm, etc.
The paper starts with Introduction. Section 2 reviews similar work. In Section 3, we provide a brief description about the tool and the working of the tool is explained in section 4. Section 5 presents two case studies and Section 6 concludes the paper.
Literature Review
A lot of work has been done in the literature to propose different solutions to analyze the performance of RTMSs. Deng and Purvis [3] discussed a generic model using tandem queuing with the presumption that the parallelization is only possible for applications that can work in pipeline. As per this model, an application can be split apart into independent procedures, and each procedure can be served by one or more cores in parallel. The cores are allocated based on the processing time needed for the procedure. Deng and Purvis [3] derived a lemma stating that the number of the servers must be assigned to the system in proportion to the square roots of their processing time respectively, to gain the minimal time in system. Two test analysis are presented using Snort parallelization and Image retrieval using Poise [4] [5] .
Jing Lee and Kunal [6] presented a model for the decomposition of the parallel tasks into a set of sequential tasks and assign appropriate release time and deadlines to them. Parallel synchronous task sets were generated randomly and each task is assigned a valid period. The tasks were added to the cores till the total utilization is not exceeded. The tasks were scheduled using global Earliest Deadline First (EDF) and Partitioned Deadline Monotonic (PDM) scheduling. Sudipta and Chong [1] considered shared cache and a shared bus for the multiple cores. They defined a unified analysis framework to feature the components like pipeline, shared cache and bus through the use of TDMA based round robin arbitration policy to assign a fixed length bus to each core. They used Least Recently Used (LRU) replacement policy for the cache replacement.
Chen et al. in [7] used the Heterogenous Dual-core Scheduling (HDS) algorithm in [7] to break down each task into sub-tasks with assumption that processor sub-tasks can be considered as a set of sporadic tasks that are scheduled by EDF. Specifically used for heterogeneous multi-core systems, this real-time task scheduler algorithm worked on non-preemptive co-processor and preemption points were inserted into coprocessor sub-tasks using program slicing so that the coprocessor can be semi preemptive [7] . Experimental work was done to compare the HDS to EDF, Rate Mono-tonic (RM), Priority Ceiling Protocol (PCP), and Stack Resource Protocol (SRP) algorithms [7] . The task set was generated by TGFF [8] .
Another work by Zhaobin and Wenyu [9] presented the analysis based on the pipe-lining technology that proposes a reverse interleaved pipe-lining algorithm (RIPA) scheduling strategy to decrease the total I/O execution time by balancing the workload on the homogenous processors. The self-adaptive scheduling method was used to select the most appropriate core to process the I/O task in the heterogeneous multicore systems. They consider independent I/O tasks that do not interact and arrive in the regular intervals. Self-adaptive scheduling works on random I/O tasks environment.
Directed-Acyclic Graph (DAG) schedulers are required for the multi-core processors. A new approach is proposed for the task's children of different deadlines by Karim and David [10] . The novel unified DAG monitoring solution known as DAG Flow Manager(DFM) is low-complexity, fully independent and does not impose any restrictions on DAG. It allows the simple connected schedulers to have optimal control of the core assignments. The solution is tested on H.264 decoder with different DAG configurations.
Special hardware architectures like Multi-core Execution of Hard Real-time applications supporting analyzability (MERASA) are also designed for guaranteed analyzability and timing predictability of the multi-core systems [11] . The main objective of the architecture is to make the analysis of each task independent from the co-scheduled tasks to provide a safe and tight WCET estimations by isolating the tasks execution from the inter-task interference in both hardware and software. In this project thread scheduling is done by the hardware which isolates and prioritizes the Hard real time and Non-Hard real time threads running on the same core. Analyzable real-time Memory controller (AMC) is designed to minimize the effect of the inter-task interference in the memory. Another dynamic heterogeneous multi-core architecture is proposed by Mihai and Tulika at National University of Singapore, known as Bahurupi (exist in many forms) [12] . The model works on the creation of the number of threads to execute number of tasks. Here core corresponds to threads and basic blocks to the tasks. The core fetch basic blocks to execute and once completed the execution, it fetches the next available basic block. The main task here is to resolve register and memory dependencies amongst the basic blocks. The important requirement in the Bahurupi model is to detect inter-dependencies amongst the basic blocks.
Miao Ju and Hun Jung proposed the core/thread combinations to improve the performance of the multicore systems [13] . A fast, packet latency estimation algorithm is developed that works at the thread level, overlooking the instruction level and micro-architectural details. They worked on the communication processors consisting of set of cores and each core may run more than one thread. They scheduled the threads based on a given thread scheduling discipline. Packets are partitioned and are allocated to different cores by the dispatcher. The code path is defined which is the sequence of events with event inter-arrival time or event segment length. Table 1 compares the different literature work discussed above.
Proposed Design Flow
In order to analyze and evaluate the performance of RTMSs, designers have to explore various design parameters on four different levels of abstraction. Our proposed tool provides a comprehensive solution that dresses the design requirements on all these four levels and generates the overall CPU utilization at the end. The tool can take any high level program that was written for a single core as an input and the designer can set the number of cores to be as needed. The tool then automatically generates the task sets, allocates tasks in queues, performs task mapping, and finally generates the net CPU utilization result. Figure 1 shows a simplified abstraction of the basic operations of the tool during the targeted four phases. The four phases are explained below. Thread scheduling discipline dimension Simulation tool and Model X X X
Phase-I: Dependence Generator
In this phase, the tool extracts the dependencies among the tasks and uses a regular iterative algorithm (RIA) to obtain an directed acyclic graph (DAG), which is in turn converted to the task dependability matrix. The task dependency matrix is the input for this tool at present with the execution time of each task [14] .
Phase-II: Sequence Generator
In this phase, the tool generates a sequence set for all tasks. Each sequence set consists of a number of independent tasks that could be executed concurrently. Task sequences are sorted based on execution time of the tasks within the same sequence.
Phase-III: Task Scheduler
Once the Sequence Generator has finished the allocation of sequences, the task scheduler starts scheduling tasks in each sequence to be executed based on a specific scheduling algorithm, such as EDF, RM, etc. Task Scheduler defines the order of task execution by assigning a priority level to each task and transfers the highest priority tasks to the ready queue.
Phase-IV: Task Mapper
Finally, the task mapper uses a mapping algorithm to assign tasks from the ready queue to actual microprocessors based on the availability of these tasks in the ready queue and the processors at each tick time. The mapping algorithm aims to maximize the multi-core system efficiency by optimizing processor utilization.
Using SMARTs, designers will have full control over the design flow. They can change the design configuration and test different scenarios to find the optimum design for a target application.
SMARTs: Simulating Multi-core Activities for Real Time systems
SMARTs is an object oriented tool that has a node object and other relevant methods. Within the tool, each task is instantiated as a node. The node declaration provides an ID to each node, generates the sources set, sink set, and assigns a sequence and execution time period to each task. The nodes take the values from the task dependability matrix, and through find the schedule() function, task sequences are generated during the second phase. The task sequences contain tasks that can run in parallel. The function assign priority() generates the task queue based on execution time periods of each task. Finally, the function task mapping() is used to map tasks to different cores for the final execution. For the purpose of creating the utilization matrix, a fill cores() function is defined.
As a proof of concept, SMARTs is tested with two different case studies having programs written in 'C' language, one of which is adopted from the benchmark 'C' code [15] .
Case Studies

Case Study I
The first case study is a simple 'C' program for calculating the maximum, minimum, sum, and average of two arrays.
Phase-I: Dependence Generator
The task dependency matrix (M) of the code is shown below. 
Phase-II: Sequence Generator
In this phase, the tool works on the dependency matrix and generates task sequences. Figure 2 shows the task sequence generated by the tool. 
Phase-III: Task Scheduler
As defined in the sequence set, the tasks are put into the ready queue after allocating a priority to each task. At present, we are using static algorithm, i.e. RM(Rate Monotonic), where tasks with short execution time are assigned higher priority. This algorithm is used for scheduling periodic tasks. In future work, more scheduling algorithms will be added to the tool library so that designers can choose different algorithms and evaluate the systems performance accordingly. Figure 3 shows the output of Phase=III. 
Phase IV : Task Mapper
In this phase, tasks are finally allocated/mapped to different cores for final processing. Different mapping algorithms could be used to execute this task. In our case study, each task is allocated to a core and once the task execution is complete, other tasks of the same sequence is allocated to the same core till all the tasks of the sequence set are executed. Figure 4 demonstrates the task mapping and the overall time estimation for processing all tasks on the multi-core system. Figure ? ? shows the CPU utilization chart generated by SMARTs for the given tasks. 0's represent times when the core is idle and 1's represent times when the core is running. 
Case study II
A benchmark 'C' code from [15] is used for another study. This code computes the sum, mean, variance, standard deviation, and correlation coefficient between two arrays. The similar results are generated for this case study as well. We only present the result generated by the last phase.
Phase-I: Dependence Generator
The task dependency matrix (N) of this code is as follows: 
Phase-II: Sequence Generator
In this phase, for the benchmark 'C code, 6 sequence sets are generated. Sequence set 0, 1, 2 and 3 contained two tasks each, whereas sequence set 4 a,d 5 contained only one task.
Phase-III: Task Scheduler
Once the tasks are put into the ready queue after allocating a priority to each task through the RM algorithm, the ready queue is generated. In this phase, the ready queue is generated having tasks prioritized based on execution time of each task.
Phase IV : Task Mapper
The final task mapping is performed here. Through the partitioned mapping algorithm, all the tasks are mapped and executed. Figure 5 demonstrates the task mapping and the overall time estimation for processing all tasks on the multi-core system. To show the significance of our tool, we changed only one design parameter and compared the CPU utilization to see which design would be the best. The CPU utilization for the given tasks was 77.083% when we used two cores and it reduces to 51.38% when three cores are used. This even further degraded, when we used four cores to 38.54%. This is a 33.3% change in the utilization.
Conclusion
In this paper, we presented a new promising software tool that could be used to analyze and evaluate the performance of real-time multi-core systems. The proposed tool, SMARTs, not only maps a singlecore program into multi-core hardware platform, but also gives designers the capabilities to try different design parameters and compare the system performance at early design phases. For the first time, we are introducing a comprehensive tool that allows designers to explore the design space at four different levels. The tool can parse a high-level language code, generates dependency matrix, allocates tasks into different sequence sets, and maps tasks that are ready to be executed on a multi-core hardware platform. To validate our proposed tool, we presented two case studies. In the second case study, we were able to show a 33.3% change in the utilization by just changing one design parameter, which verifies it's significance.
