Abstract-The MPRA (Multi Pipeline Register Architecture) was modified and converted into n-task MPRA (nMPRA) by replicating the pipeline registers. While the original MPRA provided hardware scheduling, the interrupts and the events caused too long delays. The author proposes the original solutions for the interrupts and the events treatment, which represent the author's contribution to improving nMPRA; after the theoretical presentations of these solutions in the author's previous articles, this paper presents the implementations of the schemes, the results of the tests and the improved schemes. The MPRA, MPRA4 and MPRA8 implementations on FPGA (Field Programmable Gate Array) were used to evaluate performances. A detailed analysis, partially presented in this paper, shows other advantages: no extra software is required, the hardware implementation is simple, the interrupts and events are similarly handled and the tasks synchronizations and communications are completely based on hardware; MPRA has a low power consumption, even multiplied by eight times, it is reasonably necessary memory and logic resource consumption multiplied by about four times at MPRA4 (compared to MPRA) and by about eight times at MPRA8.
I. INTRODUCTION
There are significant changes in automation technologies, the intelligent monitoring of the buildings and the industrial process control due to pressure for reducing operational costs and the need to integrate major advances in software and telecommunications fields. In addition to the information and communication technologies, the embedded systems and the networks are also two engines that lead to the technological progress.
The embedded systems already play a central role in our daily lives, although their presence is often hidden. More than 98% of all worldwide produced processors are used for the adjustment, the monitoring and the controlling of the devices that are found in the everyday aspects of the life. For example, they are found in everything from the ABS and the ESP systems for the vehicles, the smart phones for the communications and the informations services, from household appliances to the systems installation for the industrial production [1] .
The current embedded systems are designed to validate the technologies that will transform the innovative processes and applications into reality.
Terms such as Real-Time Systems (RTS), Real-Time Operating Systems (RTOS) and microcontrollers are in close correlation with the embedded systems and their future version of Cyber-Physical Systems. The need to generate a small reaction time to the external stimuli for the fast processes has led to the in-depth research on the architecture of the processors and on the Real-Time Operating Systems (RTOS) [2] . In this case, the most researchers in the field have concluded that certain parts (or the whole RTOS) must be embedded in the hardware to increase parallel processing of information, and thus reducing the response times of the embedded systems [3] [4] .
The architecture nMPRA (Multi Pipeline Registers Architecture for n tasks) is innovative, with the very low response times (of the order of the nanoseconds) to the external stimuli [1] [2] [3] [4] . The improving of these response times was also the goal of the researcher, some results are presented in the own papers.
Haven't most questions about real-time systems and architectures already been answered and debated? Exists in the area of real-time computing and, more importantly, new developments and technologies enabled amazing new possibilities and created new directions of research. Indeed, there are more questions to be asked and answers to be found.
To what extent the performance of hard real-time systems can be improved through hardware schedulers and what can be done to enhance their architecture and their design? The main goal of the research, partially presented in this paper, was to analyse and resolve the hardware implementation of resource management as well as interrupt and event handling of a hard real-time system by enhancing its hardware scheduler, eliminating non-determinism and overhead, and achieving the shortest latency of context switching.
This paper is a continuation of the author's research on microcontrollers for RTS, highlighted in the already published articles [8] [9] [10] .
The solutions presented in this articles are only improvements. In [2] [3] [4] , the results of the team's study are further developed by implementing nMPRA on Virtex 7.
Firstly, nMPRA architecture is summarized (has been detailed in the previous articles), analysing the architecture for 4 and 8 tasks. This is followed by the interrupts and events prioritization analysis. The paper continues with some examples of the results of the analysis of the implementation of the MPRA, MPRA4, MPRA8 and the pipeline registers on Cyclone V and some graphs. Finally, the conclusions on the results are presented.
II. THE MULTI PIPELINE REGISTERS ARCHITECTURE FOR N TASKS
The most important problem of a Real Time System (RTS) is the planning of the resources: the processor, the memory, the I/O ports and the communication networks when the systems are distributed. More and more complex systems rely on the CPU control mainly. The RTS specific role is to ensure the predictability and deterministic control of a process. The rapidity is not a specific feature of a RTS, being more abstract. The speed of the response to the events is the own concept of the RTS, but it is different from the above. The hardware RTOS supports can be easily implemented on the FPGA (Field Programmable Gate Array) devices [11] [12] . The strong evolution of FPGAs has influenced both the design methodology and the requirements for development tools [13] [14] .
An embedded real-time system, as any technical system, must be reliable and secure. In this context, the hardware schedulers have an important role that relieve the processor from the scheduling activity of the task by taking over it [15] . nMPRA has an integrated hardware scheduler [16] [17] . The MIPS32 (Microprocessor without Interlocked Pipeline Stages) pipeline architecture was chosen as the structure for the research because of its availability in the various forms. The MIPS processor uses a reduced set of instructions and therefore results in a minimal human effort. The pipeline improves the transfer rate of the instruction. The intention is to increase the processor speed and to simplify the hardware, in order to the cost reasons.
The MIPS32 processor is, essentially, a CPU in a single machine cycle, having added advanced features, namely four pipeline registers, a unit of hazard and a forward unit. The advantage of these four pipeline registers is that the processor can execute multiple instructions at the same time ensuring a parallelism at the execution instruction level [18] .
The framework of the MIPS32 pipeline architecture [19] was used for the MPRA4 and MPRA8 design. Fig. 1 shows the MPRA8 architecture. What can be seen as a disadvantage is the higher hardware resource utilization (and cost); however, other solutions do not offer the same cost-performance ratio and the same interrupt and event reduced latency that is critical to real-time systems with very tight time constraints [17] .
If this processor is implemented using the FPGA technology, it is possible to upgrade the system with new features requested by the users [21] . The nMPRA is related in detail in [1] [2] [3] [4] , partially resumed by the author in [8] [9] [10] , so no longer insist here on this architecture, presenting some practical results. The architecture has been improved by the author through an innovative method of the unified treatment of the interrupts and the events. The partial implementation of the proposed architecture on FPGA by writing the Verilog code, allowed the testing of the theoretical conclusions drawn by the author in [8] [9] [10] and by the team in [1] [2] [3] [4] . If a comparison is made between different real-time architectures (Fig. 2) , it is noted that the replication of the pipeline registers of the nMPRA (nMPRA: 1-3 processor cycles; hthreats: 525-990 processor cycles; ARPA-MT: 72 processor cycles; Kuacharoen: 125 processor cycles [1] ) leads to an increase in the tasks switching speed.
III. THE PRIORITIZATION OF THE INTERRUPTS AND THE EVENTS
The software solution proposed in [5] , although it provides a convenient change of the interrupts priorities, it cannot be accepted because of the large delays product by the test blocks and of the interrupts treatment routines. In addition, the delay depends on the position of the interrupt in the test block. The event is detected in a single loop.
Unlike the software solution (Fig. 3a) , the hardware solution (Fig. 3b) offers the same response time for any interrupt. Thus, each interrupt takes priority of the task that is associated [8] . In a case of multiple interrupts, priority encoder will decide the highest priority (Fig. 4) . So, the decision block will produce the same delay [8] . The interrupts prioritization scheme was extended to the events thus becoming a generalized solution for any new type of the event that can be attached to the nMPRA, to handle the situation when the multiple events are active [9] . In [9] [10] , a hardware solution for the events prioritizing is proposed. The proposed solution includes a unified management of the interrupts and the events. The hardware treatment of the events reduces the required time to identify the source of the event and launch service routine.
The scheme is used to prioritize all categories of events (Fig. 5) . The global events prioritization scheme (Fig. 5) is first necessary if the several events become active at the same time; it is strictly necessary to make an efficient selection of the event with the highest priority [10] .
What happens if there are multiple events attached to the same task, will pass one, several or none?
According to the prioritization presented scheme, a single event will finally pass (Fig. 6) . If there are multiple active events in the selected category, it is necessary to make another events selection in that category to determine who will be first treated. This second selection depends on the category of events.
The events like those presented (TEv i , WDEv i , D1Ev i şi D2Ev i ) and those globally treated by software have an associated trap register, pointing to their routine serving (Fig. 7) . The routine service addresses of the event are loaded into a vector register by sCPU 0 to start, after reset. After the order selection, it obtains the routine service address associated with the selected event (the event service routine is executed by the sCPU i ). NextEv i signal, turned off during an event treatment, indicate that an event just is treated and no other event cannot be served until the end of the event handling. Also, the prioritization is completed by the Program Counter (PC i ) corresponding to the task, with a modified architecture accordingly to save the internal return address from an event handler (Fig. 8) , presented in [10] . Inside the PC i there is a register called BackUp_PC, that saves the current address of the PC i when the event occurs, signaled by sCPU_Ev i . PC i will automatically load, using the properly trap register, the address of the active event with the highest priority. The return to the event handler is indicated by execution of the instruction retesr (return from the event service routine), which causes activation of the ret_esr, indicating to the PC i to load the return address of BackUp_PC register, continuing the normal execution of the program. Saving the return address in the BackUp_PC register, the NextEv i signal is the deactivated, indicating that an event just is treated and no other event cannot be served until the end of the event handle. After returning from the current event treatment, the signal NextEv i will be reactivated, indicating that another event may be processed [10] . In [9] [10] , the global prioritization events partial scheme is detailed presented, so, in this article, a few results after the writing of the Verilog code on Cyclone V are exemplified.
IV. ANALYSIS ON CYCLONE V
Cyclone V FPGAs from Altera (Fig. 9) offer the lowest cost and power consumption, and the performance levels that make it the ideal device family of the high-volume applications. The total power consumption is 40% lower compared to the previous generation, the logic integration capabilities are efficient. With Cyclone V, FPGAs can get the power, cost and performance levels required for high volume applications, including motor control units, converter and video capture cards and portable devices [22] . The prioritization scheme and the MPRA architecture for 4 and 8 tasks were implemented and analyzed on Cyclone V, including: the adaptation and the implementation in Verilog code of the basic elements of the CPU (control unit, data memory and instruction memory, hazard detection unit, forward units, ALU, multiplexers); the Verilog code for the structures with the multiplied and multiplexed specific nMPRA resources: the PC register, the pipeline registers, the Register File and any other memory element. It was performed the compilation, the analysis, and the simulation of the MPRA, the MPRA4 and the MPRA8 architectures, the individual simulation and testing of the system structures with the multiplexing resources: the PC register, the pipeline registers, and the Register File. It conducted an analysis of the power consumption and the resources used by each one, then a comparison to demonstrate that the real-time systems requirement of as low as power consumption possible to avoid oversized platforms is performed (e.g. Fig. 10-11) .
For the analysis and the synthesis, the MPRA, the MPRA4, and the MPRA8, then the pipeline registers, the PC register, and the Register File for each of the three architectures, namely: IFID, IFID4, IFID8, IDEX, IDEX4, IDEX8, EXMEM, EXMEM4, EXMEM8, MEMWB, MEMWB4, MEMWB8, PC, PC4, PC8, RegisterFile, RegisterFile4 and RegisterFile8, all of them were implemented on Cyclone V -5CGTFD9E5F35C7 (e.g. Fig.  10-11) .
The charts obtained from the analysis, a few were exemplified ( Fig. 10-11 ), reveal what was wanted namely the reduced power consumption of the MPRA8, much smaller memory requirements compared to other RTOS architectures and the reasonable consumption of logic resources.
The resources consumption normally increases associated with the multiplexers that are very high. Multiplying four and eight times of the logic is approximate because there are state elements, other than those listed (PC, Register File, pipeline registers), to be multiplied; for example, there are such elements of the memory in the coprocessor 0 and in the multiplication/ division unit.
All storage elements have multiplied and multiplexed, for example, the statements of reg, which are memory elements; also those of the coprocessor 0 and of the multiplication and division unit. Because of the registers, multiplexers, the encoder and trap register in the prioritization scheme, the FPGA energy consumtion is greatly increased.
According to the Cyclone V study, if trap cells are use to prioritize events (Fig. 7) , consumption of the resourses increases by about 5%.
If only the encoder is used (Fig. 6) , with no trap cells, consumption increases by only 2.69%.
Although the resources and energy consumption is good as a result of improvements to efficiently manage the interrupts and events, even if the number of processor-cycles increases, the support is not oversized, the task switching speed remains high compared to other microcontrollers, the cost can be considered as low compared to the requirements of the Real Time Systems.
V. CONCLUSION
The practical results presented lead to the conclusion that the FPGAs in the alternative of the hardware are compact and have low power consumption. Although nMPRA is the architecture with the multiplexing resources, the memory requirements to implement the processor varies between 10 and 35 kB, depending on the number of tasks and depth of nesting call functions, reasonable considering that in the current microcontrollers the RAM capacity for general use can vary between 256KB and 2,6MB only.
In terms of power, the conclusion is that MPRA8 is efficient microcontroller architecture; the power consumption is reduced, the memory requirements are reasonable, so the platforms are not oversized. In most cases, a number of tasks ranging around 16 are more than enough for most industrial and automotive applications.
If higher speed is desired, then resource consumption increases. Even though the solution proposed in this paper increases the number of processor-cycles (with more than 10 + 1 ÷ 3 processor-cycles for the software solution and with more than 2 + 1 ÷ 3 processor-cycles for the hardware solution), however the time of the response is very good compared to other processors.
Although nMPRA is the architecture with the shared resources, the cost is more efficient than other commercial architectures, comparing the necessary of the memory and the power consumption.
The author considers that it is necessary to continue the analysis of the behavior depending on the working frequency and looking for the new solutions for the lowest possible power consumption. In the future, the implementation will be tested on the Xilinx and will make the comparisons with the Altera.
The expected performances in this presented research refer to: the switching speed of the task; the behavior of the interrupts; the response time to the external events; the communications run time for the inter-processes synchronizations.
The solutions presented in this articles are only improvements. As a conclusion from the performed tests, the multiplied microcontroller can be implemented on the FPGA, but the resources are needed as that existing at the Cyclone V of Altera or, better, at the Virtex 7 of Xilinx. The FPGA disadvantage lies in the power consumption, so, in the future, we will study the possibility of implementing on the other media and the analysis of the hardware scheduler as an integrated coprocessor.
