We present a new approach to implementing real-time transactions on memory-resident data on sharedmemory multiprocessors. This approach a l l o ws hard deadlines to be supported without undue overhead.
Introduction
Many applications exist in which hard real-time transactions must be supported examples include embedded control applications, defense systems, and avionics systems. With the recent advent o f w orkstation-class multiprocessor systems, multiprocessor-based implementations of these systems are of growing importance. Unfortunately, the problem of implementing hard real-time transactions on multiprocessors has received relatively Work supported, in part, by NSF grants CCR 9216421 and CCR 9510156, and by a Y oung Investigator Award from the U.S. Army Research O ce, grant n umber DAAH04-95-1-0323. The rst author was also supported by an Alfred P. Sloan Research Fellowship.
little attention. As explained in the next section, with most (if not all) previously proposed schemes for implementing such transactions, the kinds of transactions that can be supported is often severely limited, and worst-case performance (either actual or estimated) is often very poor, adversely impacting schedulability.
In this paper, we present a new scheme for implementing hard real-time transactions on multiprocessors that does not place undue restrictions on the kinds of transactions that can be supported. This scheme gives rise to worst-case performance bounds that can be expected to be much l o wer than those arising from alternative schemes in many applications. In our approach, transactions are implemented by i n voking wait-free library routines. Concurrency control is embedded within these routines, so no special support for data management is required of the kernel or from underlying server processes. These routines reduce the overhead involved in executing transactions by exploiting the way transactions are interleaved on priority-based real-time systems.
As a result, they perform much better than previous implementations of wait-free transactions proposed for asynchronous systems 1].
Since transactions in our implementation are executed in a wait-free manner, they are not susceptible to deadlock or priority i n version, and schedulability can be checked using scheduling conditions for independent tasks. In applying these conditions, the di erence between the execution cost of a wait-free transaction and that of a sequential implementation of that transaction is essentially an overhead term. This overhead term bears a resemblance to blocking factors associated with priority i n versions that arise in lock-based schemes.
However, these overhead terms arise for di erent reasons, and in many applications, overhead terms when using our approach can be expected to be much smaller than blocking factors when using lock-based schemes. The avoidance of locks in our implementations has other important consequences. For example, there is no need for kernel-level information (e.g., priority ceilings) used for dealing with priority i n versions. As a result, mode changes are simpli ed because kernel-level information does not have to be recalculated. Also, a measure of fault-tolerance is provided: a failed transaction cannot cannot hold locks on any object and therefore cannot prevent object accesses by other transactions. Our transaction implementation is described in detail in subsequent sections. Before presenting this description, we rst review previous work on implementing hard real-time transactions on multiprocessors. This review appears in Section 2. After reviewing previous work, we present a n o verview of some of the key mechanisms used in our work in Section 3. Our transaction implementation is then described in detail in Section 4. In our approach, transactions are performed using a novel synchronization primitive, which w e call conditional compare-and-swap. W e explain how to implement this primitive from simpler primitives in Section 5. We e n d with a brief discussion of some preliminary performance results and other concluding remarks in Section 6.
Previous Work
Previous work on implementing hard real-time transactions on multiprocessors has assumed data to be memory resident and has focused on the use of lock-based concurrency control schemes. Data is assumed to be memoryresident because the unpredictability of page faults and the relatively long I/O latencies for disk accesses make supporting hard deadlines impractical. In this paper, we consider only memory-resident real-time databases.
Lock-based concurrency control schemes have been the focus of previous work because, with optimistic schemes, transactions may abort and restart due to con icts, and repeated restarts can cause a transaction to miss its deadline. On uniprocessors, the number of transaction restarts over an interval of time can be bounded (assuming memory-resident data) 3], and thus hard-deadline transactions can be supported. However, bounding the e ects of repeated restarts due to con icts across processors in a multiprocessor system does not seem practical.
When using lock-based concurrency control schemes on multiprocessors, mechanisms are needed to bound the e ects of priority i n version. The most well-known solution to this problem in uniprocessors is the priority ceiling protocol (PCP) 15, 1 6 ] . To ensure predictability i n m ultiprocessor systems, Rajkumar et al. proposed the distributed PCP (DPCP) and the multiprocessor PCP (MPCP), both of which extend the PCP 12, 13, 14] .
Both approaches assume a model in which a task can perform one or more transactions. In the DPCP, global resources are guarded by s y n c hronization processors. A task accesses a global resource by making an RPC-like call to a global critical section (GCS) server, which executes on its behalf on the synchronization processor.
The DPCP can be used in either multiprocessors or distributed systems. In contrast, the MPCP is intended for use only in shared-memory multiprocessors. In the MPCP, a task executes a GCS on its own processor.
Synchronization is achieved by using read-modify-write instructions to obtain global semaphores, which reside in shared memory.
With either the MPCP or the DPCP, tasks are susceptible to very large blocking factors, and correspondingly low processor utilizations. In addition, these schemes require a priori knowledge of all transactions' resource access requirements, which makes it di cult, if not impossible, to support dynamically-generated transactions and mode changes. Large blocking factors are partially the result of the fact that high-priority tasks on one processor can be repeatedly blocked by l o w-priority tasks on other processors. Blocking factors are also increased by task suspensions a task suspends itself in the DPCP, for example, whenever it accesses a GCS server. In e ect, such suspensions break a task into a sequence of subtasks, and each subtask may experience a priority inversion. In contrast, an instance of a task can experience at most one priority i n version when using the PCP on a uniprocessor. To the best of our knowledge, neither the DPCP nor the MPCP has been implemented in real systems. This is certainly due to the fact that these schemes give rise to such large blocking overheads.
Researchers at the University of Illinois have proposed an \end-to-end" approach for sharing resources in multiprocessors, in which t a s k s a r e c o n verted into a sequence of periodic subtasks using heuristics 7, 1 7 ] . Each subtask can perform either local computation or a single-processor transaction. Heuristics are used to deduce appropriate release times and deadlines for the subtasks of a task based on the timing requirements of that task.
Subtasks are scheduled on a per-processor basis, using uniprocessor scheduling schemes. Resources are accessed through the use of the PCP or a similar scheme. The end-to-end approach is based on the observation that, since subtasks are executed sequentially, there is no need to execute accesses to remote global resources at a higher priority l e v el than local tasks. Like the DPCP, the end-to-end approach requires a binding of resources to processors. This complicates the job of assigning the components (tasks and resources) of an application to processors, and restricts the kind of transactions that can be supported (e.g., a transaction cannot access resources mapped to two di erent processors). In addition, the heuristics that are applied to determine subtask release times and deadlines can be overly pessimistic for certain task sets, leading to poor schedulability. Finally, if a PCP-like s c heme is used to access resources on a processor in the end-to-end approach as proposed, then supporting dynamically-generated transactions and mode changes is problematic.
Lortz and Shin have implemented a hard real-time database system called MDARTS (Multiprocessor Database
Architecture for Real-Time Systems), in which both RPC transactions and concurrent shared-memory-based transactions are supported 11]. This was the rst (and perhaps only) actual implementation of a hard realtime multiprocessor database server with reasonable performance. However, concurrency control in MDARTS is somewhat limiting. I n M D ARTS, critical sections are implemented by disabling preemptions (to deal with con icts within a processor) and by using queue-based spin locks (to deal with con icts across processors a result, only transactions of very short duration can be e ectively supported.
Overview of our Approach
In this section, we present a n o verview of our approach to implementing transactions. In the remainder of the paper, we assume that transactions are invoked by periodic tasks. For simplicity, w e assume that all tasks are subject to hard deadlines. Each task consists of one or more phases. Each phase is either a computation phase (which doesn't access shared data) or a transaction phase. All data is assumed to be memory-resident.
Before explaining the key techniques used in our transaction implementation, we rst explain the notion of a \ w ait-free" shared object in some detail. In a wait-free object implementation, operations must be implemented using bounded, sequential code fragments, with no blocking synchronization constructs. This is often accomplished by combining a \retry loop" structure with a \helping" scheme. If a retry loop structure is used without a helping scheme, then the resulting object implementation is called \lock-free". Figure 1 depicts a lock-free enqueue operation that is implemented in this way. An item is enqueued in this implementation by u si n g a t wo-word compare-and-swap (CAS2) instruction 1 to atomically update a tail pointer and either the \next" pointer of the last item in the queue or a head pointer, depending on whether the queue is empty. This loop is executed repeatedly until the CAS2 instruction succeeds. An important p r o p e r t y o f l o c k-free implementations such as this is that operations may repeatedly interfere with each other. An interference results in the queue example when a successful CAS2 by one task results in failed CAS2 by another task.
As mentioned above, wait-free objects are often implemented by adding a helping scheme to a retry loop 1 The rst two parameters of CAS2 specify addresses of two shared variables, the next two parameters are values to which t h e s e variables are compared, and the last two parameters are new values to assign to the variables if both comparisons succeed. Figure 2: Task T3 detects no previously announced task, so it announces its transaction. Before T3 can complete its transaction, it is preempted by t a s k T2 . T ask T2 begins to help T3 to complete its transaction, but before it nishes, it is preempted by t a s k T1 . T ask T1 detects that T3 's transaction has been announced but is not nished, so it too helps T3. It then announces its own transaction, executes it, and relinquishes the processor to task T2. T ask T2 detects that T3's transaction is complete, so it announces its own transaction, executes it, and relinquishes the processor to T3 . T ask T3 detects that its transaction has been completed, so it returns. Note that the overall execution is similar to what one might nd in a lock-based system, with helping taking the place of blocking.
structure. Before beginning an operation, a task rst \announces" its intentions by storing information about its operation in a shared \announce variable". While in its retry loop, a task attempts to \help" other tasks with announced operations by performing their operations in addition to its own. If a task experiences repeated interferences, then its operation is eventually completed by another task. Care must be taken to ensure that each operation is executed at most once, and that a helped task can retrieve its return values from memory.
With most helping implementations that have been proposed 1, 2 , 9], performance is at least linear in the number of tasks sharing an object.
In a recent paper 4], we s h o wed that the cost of helping can be greatly reduced on priority-based realtime uniprocessor systems by u s i n g a t e c hnique called incremental helping. The general idea of incremental helping is illustrated in Figure 2 . Before beginning a transaction, a task must rst announce its intentions by updating a shared announce variable. Before a task is allowed to do this, however, it must rst help any previously announced transaction (on its processor) to complete execution. This scheme requires only one announce variable per processor. In contrast, previous constructions for asynchronous systems require one announce variable per task 1, 2, 9]. In addition, with incremental helping, each task helps at most one other task, while in helping schemes for asynchronous systems, each task helps all other tasks in the worst case.
With incremental helping, the worst-case time to perform a transaction is only 2 T , where T is the execution time of one transaction in the absence of preemption. Thus, this scheme scales well with transaction size and the number of tasks in the system (in fact, its performance is independent of the latter). Of the 2 T worst-case execution cost, a factor of T represents the overhead associated with helping. This overhead term is similar to blocking factors that arise in scheduling conditions when using priority inheritance schemes. However, with incremental helping, kernel overhead is avoided. Like the priority inheritance protocol, information about which tasks access which objects is not required, either for implementing the helping scheme or for applying schedulability tests to tasks. (PCP-like s c hemes do require such information.) This is because con icts are recorded dynamically through the use of the announce variable, and because worst-case time bounds do not depend on access information. Because object access requirements do not have to be predeclared, incremental helping can be used to handle hard-deadline transactions that are generated dynamically.
The notion incremental helping can be extended for application on multiprocessors. The resulting scheme, which w e call cyclic helping, is illustrated in Figure 3 . With cyclic helping, the processors are thought o f a s if they were part of a logical ring. Tasks are helped through the use of a \help counter", which cycles around the ring. To a d v ance the help counter from processor R to the next processor on the ring, a task must rst help the currently announced task on processor R. In order to perform a transaction, a task does the following:
it rst repeatedly advances the help counter until any pending announced (lower-priority) transaction on its own processor has been completed it then announces its own transaction it then repeatedly advances the help counter until is own transaction has been completed. With cyclic helping implemented on a P -processor system, the time to perform a transaction is proportional to 2 P T , where T is the time required for the longest transaction. Thus, this scheme also scales well with transaction size and the number of tasks in the system. For task sets with eve n a m o d e s t n umber of objects shared across processors, 2 P T would typically be much less than corresponding blocking factors that arise when using the MPCP or the DPCP. In addition, the MPCP and the DPCP require object access requirements to be predeclared, while cyclic helping does not.
The cyclic helping scheme just described is very similar to the circulation of a token in a token ring system, where to perform a transaction, a task rst requests the token. One may w onder whether the same technique could be applied in a lock-based system. When implementing objects on a real-time multiprocessor, the main problem to be faced is that of ensuring that worst-case object access times are reasonably short in the face of untimely task preemptions. In a lock-based implementation, the progress of the token could be stalled if it is \held" by a preempted task. If the preempted task has only partially executed its critical section, then it must resume execution before the toke n c a n b e f o r w arded. This requires a mechanism by which a task requesting the token on one processor may force a preempted task on another processor to resume execution. We k n o w o f which a transaction will be helped. To perform a transaction, the ring must be traversed twice in the worst case (as depicted).
no way to e ciently implement such a mechanism. With our cyclic helping scheme, the requesting task itself ensures that the \token" makes progress around the ring.
As the discussion above shows, there is an interesting connection between wait-free and lock-based objectsharing schemes from a schedulability standpoint. In a sense, wait-free synchronization is a \pessimistic" notion of nonblocking user-level synchronization, and lock-free synchronization is its \optimistic" counterpart.
Transaction Implementation
In this section, we present a detailed description of our transaction implementation. The implementation is based on a lock-free transaction implementation for real-time uniprocessors presented previously in 3]. This previous implementation has been modi ed for application on shared-memory multiprocessors by using a cyclic helping scheme.
Our implementation is de ned by four procedures, Exec, Help, Read, a n d Write, w h i c h are given in Figure   5 . Variable declarations that are used in these procedures are given in Figure 4 . The procedures given in Figure   5 support the \illusion" of a contiguous shared array MEM of memory words. In reality, data is not stored in if lid < N then 6: while true do 7: ver := V 8:
if Status In Figure 4 , Bank is a B-word shared array. E a c h element o f Bank contains a pointer to a block of size S.
The B blocks pointed to by Bank constitute the current v ersion of the MEM array. W e assume that an upper bound C is known on the number of blocks modi ed by a n y transaction. Because a task's transaction copies a block before modifying it, C \copy" blocks are required per task. However, the roles of these blocks are not xed. If T p 's copy blocks are installed as part of the MEM array, then T p reclaims the replaced blocks as copy blocks (lines 26-30). Thus, some of T p 's copy blocks become part of the current a r r a y, and vice versa.
As mentioned above, user-supplied transaction code accesses the MEM array i n a s e q u e n tial manner using the Read and Write procedures. After performing a consistency check t h a t w e will describe later (lines 43-44), We n o w explain the Exec and Help procedures. Synchronization is achieved in these procedures by using two primitives: compare-and-swap (CAS) and conditional compare-and-swap (CCAS). CAS is widely available, but CCAS is not. CCAS is a restriction of the more well-known two-word compare-and-swap (CAS2) instruction in which one word is a compare-only version number it's semantics is formally de ned in Figure 8 (a). The version number is incremented after each transaction and is assumed to not cycle during any single transaction. 2 This ensures that a \late" CCAS operation executed by a preempted task has no e ect. Unfortunately, CAS2 is directly provided on only a few existing processors (e.g., Motorola 68030 and 68040 processors The shared variable V is a compare-only version number that is passed to CCAS. It consists of a counter eld cnt and a boolean eld needhelp. V: cnt is assumed to not cycle during any transaction. Our transaction implementation is based on the idea of cyclic helping described in Section 3. With this helping scheme, processors are considered in turn, as if they formed a logical ring. A \help counter" is used to indicate the current processor under consideration. The value of the help counter is given by V: cnt mod P . P here is de ned to be the total number of processors in the system. When the help counter is advanced to point to processor R, V: needhelp is set to true i there is a task on processor R that needs to be helped. A task is allowed to help a task on processor R only if it detects that (V:cnt mod P = R)^V: needhelp holds. Thus, the decision whether or not to help a task on processor R is xed when the help counter is advanced to point t o R. Since this decision is made atomically when the help counter is advanced, there can be no disagreement among tasks as to whether a task on processor R should be helped.
As mentioned previously, Exec is invoked by a t a s k T p to perform a transaction. After some initialization (lines 1-2), two rounds of cyclic helping are performed (lines [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] . During the rst round, T p repeatedly advances the help counter until any pending announced (lower-priority) transaction on its processor has been completed. It then announces its own transaction (line 14) and performs a second round of cyclic helping in order to complete its own transaction. The loop at lines 6-13 performs one round of cyclic helping. The test at line 8 causes the loop to terminate once the currently announced transaction on T p 's processor has been completed and the help counter has been advanced. If the help counter points to a processor that has a task that needs help, then the Help procedure is invoked at line 9. Lines 10-13 advance the help counter to the next processor on the logical ring. Line 15 sets the announce variable on T p 's processor to indicate that no task currently requires helping.
The Help procedure is called to help a transaction of some task T q that is executing on the processor that is pointed to by the help counter. It can be shown that the Help procedure is invoked at most P times during each round of cyclic helping (2P times in total). As mentioned previously, each transaction is executed in two Before concluding this description, one subtlety t h a t w e h a ve glossed over must be mentioned. possible implementations.
In Figure 8 (b), the X parameter is tagged with a small counter eld. X is assumed here to point t o a shared variable that is updated during a round of cyclic or priority helping by means of CCAS operations | it is only updated by s u c h operations. The double angle brackets around lines 3 and 4 indicate that these lines are executed without preemption. This could be ensured in practice by either disabling interrupts or by h a ving the operating system roll back a task to line 3 if it is preempted at line 4. The counter X ;>cnt is assumed to be large enough so that it does not cycle during a single round of helping or between the execution of lines 3 a n d 4 b y a n y task. Note that, because lines 3 and 4 are executed without preemption, only a small number of bits for X ;>cnt are required (e.g., on an 8-processor machine, three or four bits would probably su ce). This is obviously important, since the counter value is being packed within a word.
It is clearly in accordance with the semantics of CCAS to return from either line 2 or line 3 or to return false from line 4, so consider the case in which the CAS at line 4 returns true. Note that line 4 is reached only after passing the test at line 3. In our transaction implementation, passing this test signi es that other tasks have not yet advanced the help counter to another processor. Because X ;>cnt cannot cycle during a single round of helping or between lines 3 and 4, it follows that the CAS at line 4 succeeds only when it should.
Our transaction implementation has the property t h a t a v ariable X that is modi ed by means of CCAS operations is assigned a sequence of distinct values during a single round of helping. Thus, it is really not necessary to de ne X ;>cnt to be large enough so that it does not cycle in one round. For applications with this property, it is possible to go a step further and completely eliminate the cnt eld. This is accomplished by inserting a delay( ) statement after any code that attempts to increment V , where in the worst case execution time of lines 3 and 4 in Figure 8 (b). With this change, if V = ver holds when line 3 is executed, then X cannot be modi ed between lines 3 and 4 by a n y task engaged in a round of helping where V > ver.
(In our transaction implementation, the delay( ) statement is not even necessary: enough code is executed between any increment o f V and subsequent CCAS that modi es X to ensure that at least time units have passed.) The result is the CCAS implementation of Figure 8 (c). A very desirable property of this implementation is that it does not require certain bits of X to be reserved for control information.
Concluding Remarks
The explanation of our transaction implementation would seem to imply that one centralized help counter is always required when implementing a real-time database system. This is not the case. In many applications, transactions can be grouped into classes such that transactions in di erent classes cannot possibly con ict with each other. Only one help counter per class is required in this case. Transactions in di erent classes may execute with complete concurrency. Moreover, the help counter for a class really only needs to cycle around the set of processors with tasks that execute transactions in that class this set may include fewer than the total number of processors.
The research outlined above l e a ves many opportunities for further research. Of foremost importance are experimental studies that compare our wait-free transaction implementation with more conventional lock-based implementations. Such studies require a real-time multiprocessor testbed. We are currently developing such a testbed at UNC, but it is not yet operational. Simulation studies are an alternative to experimentation.
However, simulation studies will be de nitive only if they are conducted using meaningful parameter values.
The parameters that a ect schedulability when using wait-free and lock-based transaction implementations are very di erent. Wait-free schemes are sensitive t o s u c h factors as copying overhead and the cost of performing synchronization instructions. Lock-based schemes are sensitive t o s u c h factors as context switching costs and kernel overhead. It is very di cult to select meaningful values for such parameters without examining actual implementations, which again requires an operational testbed.
Despite our current limitations in conducting de nitive performance studies, we h a ve conducted a preliminary study to evaluate the e ectiveness of cyclic helping. In this study, a w ait-free linked-list implementation based on cyclic helping presented by us elsewhere 4] w as compared with a lock-free list implementation presented recently by Greenwald and Cheriton 8] . These experiments were performed on a ve-processor SGI-R10000 machine.
The priority-based preemption model was simulated at the user level. The wait-free list implementation that was tested is similar to the transaction implementation presented in this paper, with optimizations that allow list operations to be performed with no copying overhead. (Such optimizations appear to be possible for many single-object transactions.)
Our algorithm uses CCAS and Greenwald and Cheriton's uses CAS2, while the SGI-R10000 provides only CAS. H o wever, we w ere able to e ciently implement CCAS and CAS2 using the approach given in Figure 8 (c) in Section 5. (In the lock-free list implementation, CAS2 is used in manner that is similar to how CCAS is used.)
In our experiments, the total time for the tasks to perform a total of 50,000 insertion/deletion operations on a sorted list was measured. Figure 9 depicts the relative performance (in tens of milliseconds) of the two implementations for di erent m ultiprogramming levels. These gures indicate that, as the multiprogramming level is increased, our algorithm outperforms Greenwald and Cheriton's. Our algorithm also outperforms theirs as the length of the list is increased, although graphs showing this have been omitted due to space limitations.
Another lock-free list implementation, which uses only CAS, w as recently proposed by V alois 18]. Although we did not test against Valois' algorithm, Greenwald and Cheriton report that their algorithm is faster than his algorithm by a factor of about three under low c o n tention, and about ten under high contention 8]. Greenwald and Cheriton also reported that their algorithm outperformed a fast spin lock algorithm with backo . We believe it is highly doubtful that OS-based multiprocessor locking protocols like the MPCP and the DPCP could exhibit performance close to that of either our algorithm or Greenwald and Cheriton's.
While our algorithm does not dramatically outperform Greenwald and Cheriton's, one should keep in mind that our algorithm is wait-free, whereas their algorithm and Valois' algorithm are only lock-free. This has important implications for hard real-time systems. In such systems, tasks must be guaranteed to meet their deadlines, and such guarantees require that tight w orst-case execution times for object accesses be known. With lock-free algorithms, determining such bounds is not easy, because accurately bounding the cost of interferences that can occur across processors is di cult. With wait-free algorithms, worst-case execution times are easily computed. The only other means of implementing wait-free linked lists that we k n o w of is to use universal constructions 2, 9 ]. Although we did not test against such constructions, they entail very high helping overhead.
As a result, they would likely perform much w orse than our list implementation.
