Abstract: This paper conducts a thorough study of the schedulability and predictability questions for a custom designed CPU architecture, named Multi Pipeline Register Architecture (nMPRA). The nMPRA CPU implementation uses replication and remapping techniques for the program counter, general purpose registers and pipeline registers, providing predictability and hardware-based isolation for hard real-time threads. We describe the real-time scheduling tests on nMPRA processor architecture, including also a fine-grained multithreading configuration. The present paper highlights several solutions to improve the performance of CPU architectures and to overcome the overhead inconveniences of the Operating Systems.
Introduction
Nowadays, some of the available CPU implementations are not feasible to be used in real-time systems (RTS) with hard real-time requirements. Such systems are really critical in terms of hard real time execution, spatial and temporal isolation. In this context, even in distributed systems, timing predictability of tasks represent defining characteristics.
It is very important for a system dedicated to the control of industrial processes to be associated to a mathematical model describing as realistic as possible the practical application. A particular attention is given to the method of calculation of the processor scheduler limit with minimum temporal jitter. In order to arrive within the deadline back to the process by means of execution components, in a typical control system that interacts with the controlled process, the results produced by the tasks of the system are based on data taken from the sensors [1] .
The jitter is a RTS-specific measure that may cause a deviation in the periodicity of the component tasks, threatening the stability and determinism of the application. In order to achieve this, the field-programmable gate array (FPGA) devices, as presented by Shahbazi et al. in [2] , available with a high capacity in logic gates, can function as a hardware support for embedded real-time operating systems.
While multi-core processors outperform single-core ones, a real worst-case execution time (WCET) analysis is more difficult to obtain. The reason is that the computed execution times differ significantly from the measured WCET estimation, due to the dynamic interaction of effects between different tasks executed in multi-core processors [3] . However, in the domain of safety-related real-time embedded systems, the multi-core architectures strive to outperform single-core architectures [4] , this is why the research on these systems is increasing.
As a result, it is crucial that the most widely used system in the aeronautical and automotive field provides timing predictability behavior, with spatial and temporal isolation for hard real-time tasks. In order to improve the predictability of such systems, and to avoid the excessive use of resources, synchronization and communication mechanisms are needed to ensure a tight margin of the WCET for different tasks executed on single-core processors.
The purpose of this paper is to continue and extends the nMPRA project presented by Gaitan et al. in [5] . We present the experimental tests executed on a five stage pipeline architecture named nMPRA-MT (Multi Pipeline Register Architecture-Fine-grained Multithreading) [6] .
At the hardware level, nMPRA-MT guarantees interrupt response with minimum temporal jitter and a minimum delay. Furthermore, the proposed CPU architecture provides predictability and hardware-based isolation for hard real-time threads. We evaluate experimentally design tradeoffs that arise when seeking to isolate tasks of different criticalities and to maintain overheads commensurate with a standard RTOS. This paper is structured as follows: after a brief introduction located in Section 1, an overview of the nMPRA and nMPRA-MT architectures is presented in Section 2. The experimental results achieved during the tests are presented in Section 3. Subsequently, Section 4 gives an overview of the related work, while Section 5 concludes the paper.
Overview of the nMPRA and nMPRA-MT Architecture

nMPRA Architecture
The nMPRA architecture was designed and accomplished in VHDL language. It uses replication and remapping techniques for the program counter, general purpose registers and pipeline registers, providing predictability and hardware-based isolation for hard real-time threads.
In this pipeline nMPRA can run up to five instructions, so that the execution rate would finally reach one instruction per clock cycle. Fig. 1 presents the nMPRA architecture involving a five-stage pipeline designed to improve computational performance and also to maintain the advantages of the pipeline. 
nMPRA-MT Architecture
The nMPRA-MT architecture was implemented in Verilog (abstract language with a syntax similar to that
International Journal of Computer and Electrical Engineering
of C language). The project is a continuation of the nMPRA architecture and nHSE hardware scheduler. By including in this new architecture the fundamental idea of the PRET and FlexPRET concepts [7] , [8] , we also make full use of its temporal repeatable behavior. For this purpose, high-priority tasks are classified and transformed into hard threads-HT and low-priority tasks into soft threads-ST.
The goal of this project was to provide predictability of hard real-time tasks, based on a fast switching operation between a blocked task and another one, scheduled by nHSE.
All the resources of the processor pipeline are shared by every thread, except for the HT0. This is the only one active after power-on or reset, having access to the nHSE configuration registers. Because a periodic and precise computation for each thread state is required, the nHSE is developed to work on the falling edge of the clock, and therefore, it is always in the active state. In order to accomplish a fast thread context switch, each HT or ST thread has a PC, a distinct bank in the register file and separate pipeline registers.
Because nMPRA-MT architecture uses resource remapping techniques for the pipeline registers and CPU working registers, nHSE interleaves different threads in the pipeline assembly line without losing clock cycles due to contexts switching operation.
In order to preserve the spatial isolation between threads, the nMPRA-MT architecture allows HTs to read anywhere in the memory, but to write only in certain areas. Moreover, the areas are specified by the nHSE that can only be set by the high priority thread. In terms of timing predictability, nMPRA-MT uses a separate memory for HT threads and a common memory space for ST threads. There is an active research concerning the memory controller, in order to obtain an efficient management and to reduce the WCET.
Pipeline and Thread Management
In order to provide both temporal and hardware-based isolation, the highest criticality tasks are assigned to HTs. nMPRA-MT architecture is equipped with a preemptive scheduler implemented in hardware that is part of the CPU itself.
In order to reduce pipeline costs, the five-stage data path was modified. As we can see in Fig. 2 , the multiplexor which selects the data to be written in the register file through the MEM/WB pipeline registers is placed after the data memory. With this new solution, the MEM/WB pipeline registers have reduced dimensions and memorize only one value to be written in the next stage. The dynamic nHSE is based on the remapping algorithm proposed by the nMPRA project in [5] , being capable to perform fast task switching operations and to guarantee high performance in the execution of the pipeline assembly line.
Experimental Results
International Journal of Computer and Electrical Engineering
In this section, we focus on testing the presented concept, by introducing the experimental results of the implemented project as FPGA prototype. The project has been implemented on a Virtex-6 FPGA ML605 Evaluation Kit from Xilinx and the code of the processor has been developed in standard Verilog. The nMPRA-MT processor was implemented for a working frequency of 10MHz, 50MHz and 75MHz.
In Fig. 3 , channel 1 from the image captured from the oscilloscope represents the clock (nMPRA_clock) used for pipeline registers and data memories. On the 2nd channel of the oscilloscope, we can see the clock waveform used for task switching in the nHSE block (nHSE_clock). This clock signal is out-of-phase with 240 degrees as compared to the nMPRA_clock signal, both signals having 33% filling factor. An asynchronous external event is shown on the 3rd channel of the oscilloscope. In this case, the event is obtained by pushing a button on the ML605 board; channel 4 marks the answer of the software application which sets a pin. Fig. 3 . Jitter of the highest priority thread HT0 in relation with an external event.
At a 50MHz frequency without synchronization and communication mechanisms, when the HT0 thread doesn't executes sw instructions, the answer of the scheduler to an asynchronous external event may be around 27ns, depending on the occurrence time of the event.
The processor was tested for three possible implementations of 4, 8 and 16 tasks where the maximum nesting depth of functions is on 6 levels. The approximated implementation percentage costs, relative to the FPGA device used for validation (Xilinx Virtex-6 xc6vlx240t-1ff1156), may increase with a large number of threads and with a more complex functionality of the nHSE, Hazard Detection Unit, Forward Unit and synchronization and communication mechanisms. The implementation of this architecture for a large number of tasks would call for the synthesis of a logic in which the propagation time would be groundlessly high and thus the working frequency would significantly drop.
In the pipeline processors implemented in hardware, all the logic modules operate concurrently. This means that our CPU clock must control the fair CPU structures at the required time. In order to provide the predictability of each HT execution thread, the constant scheduling frequency is required. At a first glance, it can be said that replication of resources and remapping techniques used by nMPRA processors may represent an inefficient use of resources. However, it is important to notice that validation of nMPRA-MT architecture could provide higher throughput per area and power savings, compared to other single or multi-core systems for certain task sets.
However, for a mixed-criticality task set where predictability and hardware-based isolation guarantees for HTs are the main constraints, the nMPRA-MT processor presents a better functionality than many other traditional processors.
Due to the multiplication of resources and executing one instruction from the same thread every two clock cycles, the pipeline will never be stalled and the scheduling procedures will never have to save the context of the current task.
Related Work
The Merasa project [9] was developed to obtain a processor architecture which can be successfully used in hard real-time embedded systems. The priorities for execution threads are fixed and the scheduling policy chosen is round robin. Each core is made up of two scratchpad memories; one of the memories is used for data and the other for instructions (D-ISP and DSP); the data integrity is ensured by individual allocation of a subnet of banks cache for each task. Concerning the architecture, each core can have only one HRT execution thread and an arbitrary number of NHRT execution threads. Taking into consideration that the embedded systems have limited resources available, present-day architectures must offer an optimal cost for the implementation of an average number of HRT and NHRT execution threads, including their synchronization and communication mechanisms. If the HRT thread is suspended, pending an external time event or sharing a resource with another HRT or NHRT task, its dedicated assembly line will remain unused and it will negatively influence the performance of the entire system. Therefore, although space isolation of HRT treads is guaranteed, the predictability of Merasa architecture depends on hazard situations occurring in the classical assembly lines and on the synchronization and communication mechanisms between the HRT and NHRT threads from the entire multi-core system.
In [10] , Andalam proposes the ARPRET architecture, obtaining predictability by projecting a particular soft-core coupled with a hardware accelerator, called the Predictable Functional Unit. Thus, time behavior for models and programs becomes most important because, in order to guarantee that a hard real time system behaves according to the model, their characteristics must be preserved during compilation.
The aim of Komodo's project, as presented by Kreuzinger et al. in [11] , is to use the Komodo Java-based multithreading microcontroller for handling multiple real-time threads. Thus, the architecture uses multiple stack register sets, program counters and instruction windows, and a signal unit to manage a set of threads triggered by interrupts. In order to ensure real-time support and manage multiple threads of different priorities with the proposed four-stage pipeline architecture, the picoJava instruction set is enhanced. Because the Komodo Priority manager supports fast context-switching, another thread can use the unallocated cycles in case a branch or memory access causes pipeline stagnation.
Although other threads can be executed to increase throughput, this paper does not offer a description of any synchronization or communication mechanism.
Conclusion and Future Work
The current paper extends the basic project presented by Gaitan et al. in [5] , proposing original new improvements for the nMPRA and nHSE. The aim of this project was to obtain a predictable architecture dedicated to small real-time applications.
The processor architecture is subject to new improvements such as:
