Abstract. In multitasking, priority-driven systems, resource access-control protocols such as Priority Ceiling Protocol (PCP) reduce the undesirable effects of resource contention. In general, software implementation of these protocols entails costly computations that can degrade the system performance to unacceptable levels. In this paper, we present the design for a hardware-accelerator to execute the PCP functionality for controlling access to multiple-unit resources and illustrate that the proposed implementation accelerates the execution time by a factor of up to 30.
Introduction
In a multitasking uniprocessor environment, improper resource sharing among tasks could lead to significant performance penalties as well as severe adverse effects. Priority Ceiling Protocol (PCP) 1 [9] is a resource management protocol that prohibits the occurrence of deadlocks and minimizes priority inversion in such an environment. Deadlocks and priority inversion are serious problems that can have catastrophic effects in safetycritical real-time systems. Our experience has been that software implementations of these resource management policies account for a significant portion of performance degradation in such systems. In order to alleviate system's degraded performance we have designed and implemented a hard accelerator to execute the functionalities of the PCP handling multiple-unit resources. Both software and hardware implementations of the protocol have been integrated with the µC/OS-II [7] operating system running on the AVR ATmega103L [2] microcontroller implemented on a Xilinx Virtex XCV300 [12] FPGA board 2 .
We first provide a brief description of previous work accomplished in the field of accelerators and give a theoretical background explaining PCP for multiple-unit resource access. We then discuss the methodology adopted as well as the software and hardware implementations developed. Finally, we present experimental results comparing the performance of both designs.
Theoretical Background
A deadlock can occur when a task is waiting on a resource that it can never acquire. This situation often arises when more than one task in a system must acquire more than one resource at one specific time. For example, one of the simplest deadlock situations occurs when a task T 1 owns resource R1 and needs resource R2 to progress and conversely, at the same moment task T 2 owns R2 and needs R1 to progress. In order to prevent deadlocks, a resource management scheme such as PCP must keep a record of the state of each resource. Using this information, it can determine if the allocation of a particular resource to a given task would cause a deadlock situation.
Priority inversion occurs when a lower priority task blocks a higher priority task, and can be is triggered by the sequencing of the resource allocations. Consider the trivial condition where a low priority task T 1 acquires a resource that is also used by a high priority task, T 2. If T 2 blocks because it cannot acquire this resource, priority inversion occurs when T 1 runs. The problem becomes more acute if additional tasks with intermediate priorities are executing in the system since these would preempt T 1 and in the process further delay the execution of the higher priority task T 2.
Next, we briefly describe the working principles of the PCP for controlling access to multiple-unit resources. For details, we refer the reader to [8, 9] . PCP implements deadlock avoidance by assigning a priority ceiling (PC) to every resource in the system. The PC of resource R is defined as the priority of the highest priority task that uses R. In an environment where there are multiple instances of the same resource, the PC becomes a function of not only the resource type but also of the remaining number of instances of that resource. Given a resource R that has N units, the PC when there are n ≤ N free units of R is equal to the priority of the highest priority task that uses at least n instances of R. For example, given the following resource allocation graph of Fig. 1 , where task T 1 to T 4 are indexed in decreasing order of their priorities, we obtain the Priority Ceiling Array of Fig. 1 that displays the PC as a function of resource and units left. For instance, if there are 2 units of R1 left, the PC will be Π(R1, 2) = 3. PCP states that a task can only acquire a resource if its priority is greater than the PC of every resource instance currently held by other tasks. PCP protects against priority inversion by executing Priority Inheritance, which seeks to correlate the time a task is kept waiting on a resource to the relative importance given to it by its assigned priority. If a task T 1 is blocked waiting on units of a resource, the task owning those units, T 2 for instance, will acquire T 1's priority if the priority of T 2 is less than the priority of T 1. To implement PCP, we define the system priority ceiling, SysPC, which is the highest priority ceiling of the currently obtained resource units:
Also, the system task, SysTask, is defined as the task that owns the resource with a Priority Ceiling equal to SysPC.
The rule used to determine whether a task T 1 with priority π 1 can obtain resource units is:
This scheme allows the PCP algorithm to be implemented in either hardware or software. A list is kept of the resource units presently acquired and their corresponding owning task. From this list, the SysPC and SysTask variables can be easily computed and used to evaluate (1) when a task seeks to obtain resource instances. Priority Inheritance will be carried out when a task fails to acquire a resource and effectively becomes blocked by the SysTask. In this case, if the priority of the blocked task is higher than the priority of the SysTask, the tasks will exchange priorities.
When a task fails to obtain requested resource units either because it does not meet (1), or there simply aren't enough free instances, a task is blocked waiting on the resource and must go in the wait list. The scheduler can then move a task from the wait list to the ready list when the blocking resource instances become available. In order to implement PCP, tasks and resources, in this case semaphores, are uniquely identified. These identification fields are entered into nodes of the list when a task acquires semaphore units, and used to remove or modify a node when it releases units. Each node also contains the priority ceiling and amount of semaphore instances.
A hardware implementation of ICPP has been introduced in [1] . Note that ICPP takes a more straightforward approach and raises the priority of a process to the priority ceiling of the resource it has just locked. ICPP is easier to implement than OCPP as blocking relationships need not be monitored. Although ICPP is simpler to implement and reduces the amount of context switches, this protocol can increase the occurrence of priority inversion. By immediately raising the priority of a task to the priority ceiling of the resource it has just acquired, higher priority tasks that do not utilize resources could be needlessly blocked by a task that has just acquired resource units with a high PC. This further highlights our approach as compared to [1] .
Methodology
In our implementation, PCP has been decomposed into functional blocks: the semaphore acquire and release functions, as well as new task management features to support PCP, which have all been implemented in both hardware and software. The target platform, running the µC/OS-II operating system, supports up to 64 tasks of unique, 8-bit priorities that are sequenced in reverse numerical order [7] . To facilitate both the hardware and software implementations, a task is given an ID that is also its assigned priority. In the rest of the text, a task ID is synonymous with the task's assigned priority.
4.1
Outline of Acquire Function Figure 2a shows the flowchart which represents the sequence of actions that take place when a task wishes to acquire semaphore instances, starting with the function call OSSemPend(), which encapsulates the whole operation. If there are enough semaphore instances left and the task is the SysTask or it has a higher priority than SysPC and the highest priority blocked task, the task acquires the semaphore units and can progress.
On the other hand, if it fails this condition, the task becomes blocked waiting on that particular resource. At this point, priority inheritance will be executed if the task has a higher priority than the priority of the SysTask by exchanging priorities with it by calling OSTaskSwapPrio(). At the same time, these swapped priorities are pushed onto the system priority stack, which holds the priorities of every task that have exchanged priorities. Finally, since the task is now blocked, its assigned priority is removed from 
Outline of Release Function
When releasing semaphore instances, the basic steps include modifying the semaphore list and reversing priority inheritance. Figure 2b shows the flowchart representing the sequence of actions that occur during this operation. Function OSSemPost() is called, the semaphore count is incremented, and the task/semaphore node is updated with a new count and PC field, or removed completely if the task owns no more instances of the semaphore. The "reversing" of Priority Inheritance is accomplished by exploiting the fact that only a task that is currently the SysTask can inherit a higher priority and yield an inherited priority. Therefore, if a task is no longer the SysTask after releasing semaphore instances (in other words, the SysTask has changed), it must "give up" any inherited priorities. In this situation, we know a task has inherited a priority if its assigned priority is lower than the priority on top of the system priority stack. Therefore, the stack is popped and function OSTaskPrioSwap() is called to swap the priorities of the task with the assigned priority equal to the popped priority and the current task. This is seen on the left branch of the chart of Figure 2b . The stack popping and priority swapping is repeated for any other priorities inherited by the current task. The other situation occurs if a task remains the SysTask after releasing semaphore instances, but at a lower SysPC value. In this case, the task gives up any inherited priorities that are of higher value than the SysPC value.
Outline of Scheduler Execution
The task scheduler is modified to support PCP because it must now work with a ready list and the newly added wait list. The highest priority tasks of both the ready and wait list are obtained and compared. If the latter's priority is higher and it can obtain the semaphore instances it was blocked on, this task now becomes the running task. The task is therefore taken out of the wait list, put into ready list and executed by context switching to it. If the highest priority ready task is of higher priority, or if the highest priority wait task cannot obtain the blocking semaphore instances, the task from the ready list will run.
Software Implementation
The most important implementation decision of the software PCP is the choice of the data structure to hold the semaphore/owning task list. There are two criteria to consider: the cost of inserting and removing an entry and the price of determining the entry with the highest PC that holds the SysTask and SysPC values. We chose an ordered circular linked list where the first entry will have the highest PC. It is then easy to evaluate (1) each time a task wishes to acquire resource units. The disadvantage of this implementation is the time it takes to search the list when adding or removing entries, a cost that is proportional to the number of entries and executes in the order of O(n), where n is the number of task/resource pairs that currently exist in the system. Utilizing an array or hash type data structure would have alleviated this performance drawback at the cost of having to reanalyze the data structure to find the new SysTask entry every time the entry with the highest PC was removed. For example, one option would have been to uniquely identify each entry with a combination of semaphore ID and owning task ID, in effect giving us a two-dimensional matrix, of size M × N, where M is the number of tasks and N the number of semaphores in the system, which executes in the order of O(N × M) to find the entry with the highest PC. For a system with many tasks and semaphores, this search would become quite expensive and would far outweigh the cost of the linked list implementation. The performance drawbacks of the software implementation arise from the fact that priority ceiling values are not unique: several task/semaphore pairs can have the same PC value, making it costly to determine the pair or pairs with the highest PC. Task scheduling is an equivalent problem that is avoided by an RTOS that supports tasks of unique priorities: the scheduler is able to easily store the list of ready tasks and quickly determine the highest priority task ready to run. For example, the µC/OS-II scheduler stores the ready list in an 8-bit variable and uses a bit-map technique to determine the highest priority of the ready list with just two non-looping, high-level language instructions [7] . Introducing the software implementation of PCP into a RTOS that ensures low overhead by foregoing the use of costly data structures might degrade performance to unacceptable levels. The proposed solution is to implement PCP in hardware and use parallelism to avoid the performance drawbacks of the software implementation.
Hardware Implementation
The software implementation of the PCP has been ported to hardware by developing an accelerator that is a separate entity from the CPU. Since the AVR ATmega103L microcontroller uses port-based interfacing, the accelerator is addressed as a peripheral and communicates with the CPU via I/O Ports. At the heart of the accelerator is a register file used to hold the list of semaphore/owning task ID pairs that are stored in a linked-list with the software implementation. The system works as follows: when a task acquires or releases semaphore units, the accelerator updates the register file. If a task calls the Acquire function and is not granted the semaphore instances, its ID is put into the wait list that is stored in the accelerator. When a task Releases semaphore instances, the accelerator direct the priority inheritance procedure, telling the OS which priorities to swap. The accelerator will also determine if the highest priority waiting task can acquire the blocking semaphore instances when the scheduler is called. Finally, the accelerator must locally store the semaphore counts and the Priority Ceiling Array.
Accelerator Datapath
As previously stated, the main part of the datapath is the register file module that holds the semaphore/task ID pairs, comparing them in parallel fashion. ... The outputs of the comparators are also used as the enable input to 32 3-bit tri-state buffers, which make up the Count line. Thus, this signal will yield the count of the entry whose task and semaphore ID match the Data line. Figure 4 shows the top level view of the accelerator datapath. On the left are the three registers which hold the data from the CPU. On the bottom left of the diagram are the Wait List, Sem Array Regfile and Cnt Array Regfile. With its N 1-bit registers, the Wait List records which tasks are blocked (where N ≤ 64 represents the number of tasks). The encoder translates the N bits of the registers to an 8-bit signal that holds the assigned priority of the highest priority waiting task. The Sem Array Regfile and Cnt Array Regfile contain the blocking semaphore ID and number of units needed, respectively, by the task whose ID indexes them. Two other register files, this time indexed by semaphore ID, are the Current Cnt Regfile and Max Cnt Regfile, which hold the number of units currently used and the maximum units of the selected semaphore, respectively. These numbers, as well as the count of units needed or released, are fed through two stages of adders and subtractors in order to compute the amount of semaphore units remaining. This number, along with the semaphore ID, indexes the PC Array Regfile to give the PC of the semaphore/task pair which can then be fed to the Data line of the Regfile module.
The SysTask and SysPC output of the Regfile Module is supplied to the combinational logic block that is responsible for determining if semaphore units can be granted when the accelerator is executing the Acquire function. This block also takes in the assigned priority of the highest priority waiting task computed by the encoder of the Wait List and the task ID input from the CPU. With this information, the logic sets its output bit if (1) evaluates to true. The final component is the priority inheritance module responsible for implementing the system priority stack. 
Hardware Implementation Metrics
The accelerator design was implemented in hardware on the Virtex XCV 300 FPGA by the Xilinx Design Manager software, giving us the hardware implementation metrics. The equivalent gate count is 34 871 and a maximum frequency of over 9 MHz for a design that has 32 Regfile Module entries. The AVR ATmega103L microcontroller has a maximum frequency of 6 MHz, proving that both components are compatible.
Results
The acceleration gains offered by the hardware implementation of the Priority Ceiling Protocol were quantified by running the PCP functionality in simulation on the soft CPU executing the AVR ATmega103L instructions. The functions performed were the Acquire and Release of semaphore instances and scheduler execution. The system consisted of 5 tasks that share 6 semaphores with a maximum of 6 units. Figure 5 shows the task execution.
Acquire Function Results. Table 1 shows the results obtained when running the Acquire semaphore function under different scenarios. The best case execution time for the software implementation occurs when the task/semaphore pair list is empty seen as point 1 on Fig. 5 . The worst case happens when it is full with 29 task/semaphore pairs that must be traversed, at point 2 on Fig. 5 . The accelerator, on the other hand, executes the Acquire function in constant time. We have also measured the execution time for the case when a task fails to acquire semaphore instances and priority inheritance is carried out, occurring at point 3 on Fig. 5 . The results show that the acceleration is greatest when a task successfully acquires semaphore units when the list is full: in this situation, the software implementation takes over 30 times more CPU cycles to execute than with the hardware accelerator. The gain is smallest when a task fails to acquire semaphore units; in this case priority inheritance requires considerable processing by the CPU that the accelerator cannot alleviate.
Release Function Results. The best case execution time of the software implementation takes place at point 4 on Fig. 5 , when a task gives up all the units of a semaphore that it owns. The worst case scenario occurs when a task gives up only a fraction of the instances of a particular semaphore, occurring on point 5 of Fig. 5 , since the list must be traversed to find the new location for the task/semaphore pair. The execution time with the hardware accelerator is still constant regardless of the state of the system. Table 1 shows the results of the Release operation. The hardware accelerated implementation executes more than 12 times faster than the software implementation in this situation.
Scheduler Function Results. The accelerator provides computational assistance when the CPU is running the scheduler function. Again, execution time depends on the state of the system: if the wait task remains blocked after the scheduler has finished, both implementations will take less time to execute because the task will not have acquired the semaphore units it is blocked on, and the task/semaphore list will not have to be updated, occurring at point 6 in Fig. 5 . Conversely, if the waiting task becomes the running task, as seen at point 7 in Fig. 5 , the execution time of the scheduler will be longer because semaphore units will have been acquired and the list will have been updated. Table 1 shows that the accelerator speeds up the scheduler up to 7 times.
Conclusion
In this paper, we have proposed a hardware accelerator for an access-control protocol for multiple-unit resources in a uniprocessor environment. Specifically, software and hardware implementations of the Priority Ceiling Protocol for multiple-unit resources were developed and performance numbers of the two implementations were compared. As expected, the hardware accelerator showed impressive gains over the software implementation. By using a high degree of parallelism to carry out otherwise time consuming computations, the hardware implementation executes PCP in predictable amounts of time. Future work may involve adapting the accelerator to a multi-processor system, allowing it to provide even more support to the underlying OS.
