Database (DB) and transaction processing systems utilize their own locking of data items instead of using locking services provided by operating systems (OSs). There are two major reasons for this, one deals with the size of lockable items while the other one with efficiency: (i) OSs provide efficient locking of pages by utilizing the Memory Management Unit. The problem is that the system page size is much too large for locking in transaction or DB systems. (ii) Invocation of OS locking services by applications results in context switching, and in many systems in heavy weight context switching, thus greatly reducing efficiency. The focus of this paper is to investigate the use of the Multi-View Memory (MVM) model and its supporting architecture in providing efficient locking services for transaction processing or DB systems. The model provides for enforcement of access-control protocols through finite-state machine (FSM) specification on units of data that can vary in size from one region of memory to another. The locking protocol is specified by an FSM definition and thus facilitates locking of variable-sized data items by threads executing transactions. Threads executing transactions do not explicitly request locks on data items-they simply access the data items while locking is performed automatically and in many instances without software intervention. This is facilitated by hardware assistance in that the FSM definitions and lock unit state information are stored in caches. Only when a thread is suspended are the state changes communicated to the software lock manager. Delays for lock acquisitions through the MVM model architecture are determined and compared with delays due to lock acquisition by a conventional lock manager.
INTRODUCTION
It has been recognized that application-specific accesscontrol protocols are desirable to improve system performance [1] . Memory management primitives have been used extensively to provide efficient customized access-control protocols for many applications [2] . Memory management facilities have also been used to reduce communication between the kernel and user-level managers when securing access to resources/objects located in memory [3] . The SPIN microkernel [4] is extensible and provides applicationspecific services. In [5] , the kernel interface is provided at a level close to the hardware in order to provide application programs control and access to resources and thus gain efficiencies.
There are numerous examples of situations that require access-control protocols to be applied to data structures consisting of a sequence of contiguous access units in a region of virtual memory.
For instance, database (DB) applications could utilize locking of records as a method of concurrency control [6, 7] . Another example is recoverable virtual memory [8] . Access to memory must be controlled so that recovery information is properly recorded. Other examples include coherence-based protocols in multiprocessor systems (e.g. [9, 10] ) and access-control protocols for external pagers in Mach 3.0 [11] . All of these examples exhibit the same property in that the access to the units of a region of memory is governed by a particular protocol which can be defined in terms of finitestate machines (FSMs).
In many instances hardware support is provided for a particular protocol. Examples include hardware mechanisms to support coherent access in a multiprocessor environment [12] and locking mechanisms for transaction and DB systems [6, 13] . In all instances, however, the hardware mechanism is inflexible in that it is for a particular protocol in a specific environment. It has been recognized also that different access control protocols require different sizes of access units; a page, although convenient and efficient in many situations, is not a suitable access unit size THE COMPUTER JOURNAL, Vol. 41, No. 2, 1998 MULTI-VIEW MEMORY SUPPORT TO OPERATING SYSTEMS 85 for all protocols [7, 14, 15] . The protection schemes found in the Hewlett-Packard PA-RISC [16] and the PLB protection organization [17] separate protection information from address translation. Separation of the address translation and protection appears also in Opal [18] .
The Multi-View Memory (MVM) model and a supporting architecture [19] were proposed with the following objectives that address the issues and problems discussed above:
• Synchronization of tasks or threads when they are accessing variable-sized protection units.
• Flexibility in providing various access-control protocols to different applications.
• Amenability to hardware support.
How the model and the architecture attain the above objectives is briefly described in the following section. The focus of this paper is to investigate how they can be applied to the problem of concurrency control in DB and transaction processing systems, from now on simply referred to as DB systems. In particular, the paper investigates the use of the MVM model to provide efficient locking services, while overcoming the major obstacles that have prevented the use of OS locking by DB and transaction systems thus far. There are two major reasons why DB systems use their own customized lockers as opposed to using locking provided by OSs:
• One is that OSs provide efficient locking of pages through virtual memory primitives. The problem is that the system page size is much too large for locking by individual transactions. Locking large pages containing frequently accessed data, such as directories, is unacceptable for most DB systems. OS protection mechanisms for objects are unsuitable for use by DB systems to lock individual records.
• The second reason is that using OS locking results in heavy weight context switching in many systems, thus greatly reducing efficiency.
The second section briefly describes the MVM model. How the model is used to provide a strict two-phase locking (2-PL) protocol is described in the third section. The fourth section describes the architecture to support the MVM model. The fifth section evaluates the model and the architecture in terms of delays incurred when providing locking. The delays are compared to the delays incurred by a conventional locker in order to determine the suitability of MVM locking for use by DB systems. The last two sections review literature in terms of how it relates to this work and provide summary and conclusions, respectively.
MULTI-LEVEL MEMORY MODEL
In the following discussion it is assumed that virtual address spaces are flat and that threads and tasks have the usual meanings in the context of process management: one or more threads execute in an environment of a single task which has one virtual address space (shared by all of the task's threads). The sparse address space consists of memory regions that have different access control views. The concept of different views for a region is shown in Figure 1 . Access-control units, or access units for short, have the same size within a single view, but the size can vary across views. Examples of different types of access units include locking, coherency and recovery units. Applications can thus be serviced with various sizes of units across views, although the size of units is fixed within a view.
Access control for a view utilizes the concept of an access matrix in which each subject has a row while each object has a column. Each matrix entry A[S,O] is an FSM which defines the access rights of the subject S to an object O depending on the current state of access and the desired operation/access. The FSMs are identical in terms of the defined states and transitions; they differ only in their current state of access. Therefore, there is only one definition of the FSM states and transitions and a matrix entry A[S,O] represents the current state of access of S to O. Access is in terms of memory read and write operations. Objects are access units of a view and subjects define the access rights for threads, or sets of threads which actually issue the memory operations. Each view of a memory region has its own state which is independent of the states of the other views defined on the same or another region. Views do not need to be independent of each other. In fact, views that have state transitions depending on states of other views are useful. An example would be the case of a memory region that has one view for locking and another one for recovery. The view for recovery has state transitions only when a write access is made to the memory region. These transitions can be triggered efficiently by state transitions to gain exclusive access in the view for locking. In this paper we deal with the case when only one view is defined on a memory region. Thus, to simplify the presentation, we assume that the views are independent. Accessibility here implies accessibility 'on the fly' in that access rights are acquired by a thread automatically-a thread does not request access rights explicitly. The thread simply accesses the memory region and the access rights are assigned automatically by changing states of access units. With appropriate hardware support, the access rights may be acquired without software intervention at all.
86
P. BODORIK AND D. JUTLA For a memory operation to succeed, access must be permitted for each view defined on the referenced memory location. Once access is permitted for each view, the state changes independently for each view as defined by the view's FSM state transition definition.
Since access control is specified for subjects while memory operations are issued by threads, a thread must be bound to (associated with) a subject. For an operation to be permitted on an access unit, a thread must be bound to a subject which has appropriate access rights to that unit.
Consider a single memory location. It may be contained in a number of access units, one access unit per view defined on the region of memory containing that location. A memory operation causes the access control system to be invoked for each view, such that different policies are applied to the different views. In Figure 1 , for instance, the memory location X is contained in the access unit #3 of the Concurrency View, access unit #5 of the Coherence View and access unit #2 of the Recovery View. For a read operation to be permitted on the memory location X, the thread which issued the operation must be bound to a subject that has appropriate access rights to access units containing the address X.
The access-matrix method is an elegant and convenient paradigm for protection purposes. As in [17] , however, instead of the term subject ID, the computer architecture term protection domain is used. A protection domain is uniquely identified by a protection domain ID (PDID). A thread executes in one protection domain only at any one time, or to paraphrase, a thread has only one PDID at any one time, but the PDID may change throughout the thread's execution. If two threads of an application task do not have the same PDID then they do not have the same access rights to access units of views. Any two threads with the same PDID do have the same access rights to views. Thus synchronization of threads within the same task may be achieved without their explicit signalling.
The FSM definition is modified in the MVM model by storing/recording a PDID together with state information for each access unit. For a state representing exclusive access for one protection domain, its PDID is recorded in the state. If the definition of a state indicates so, a thread is permitted to access a unit having an exclusive state only if the PDID it is associated with matches the PDID recorded in the state. Each access unit thus has state information consisting of the current state of access to the unit and a PDID. A state definition of a standard FSM indicates for each input the FSM's output and a state transition. Here the inputs are read and write accesses, and outputs are either to proceed with the access or to raise an access-control fault exception. In addition to this standard state definition, for each input it is also indicated whether the PDID stored with the state is relevant, i.e., whether it is to be used when determining whether the thread can proceed or a fault is raised and the thread is suspended.
Access control is enforced by the kernel through the MVM model in cooperation with a view manager, or a manager for short. If a view definition specifies so, the kernel stores internally all state changes effected by the thread's memory accesses to units of the view. When the thread is suspended, the state changes it caused are communicated by the kernel to the manager. The manager may also instruct the kernel to change states of specific access units; these are referred to as forced state changes.
The following section provides an example that should clarify the MVM model concepts.
MVM MODEL SUPPORT FOR LOCKING
This section briefly describes how MVM can provide support to the OS in provision of locking for DB systems. It is assumed that the systems use data tables, implemented as flat files stored in the virtual memory. Transactions are executed by threads, perhaps of different tasks. Access to the records of a data table by transactions must be synchronized using a strict 2-phase locking protocol, or locking protocol for short. Prior to reading a record, a transaction must obtain a read lock on it, while before writing to a record, a transaction must obtain a write lock on that record. Holding a write lock subsumes holding a read lock. A transaction holding a read lock on a record may obtain a write lock on that record if no other transaction holds a read lock on it already. Finally, locks are held by a transaction until it completes its work (hence strict phase locking) and locking may lead to deadlocks. In DB systems, deadlock detection and recovery is used to handle deadlocks, but this is considered to be outside the scope of this paper.
The locking protocol is enforced on a data table using the MVM model in the following way. First an FSM is defined with the states and transitions shown in Table 1 . Following this, a view is defined on the memory region containing the data table. The view creation specifies that the FSM shown in Table 1 is to define the access-control protocol and, although this is not necessary in terms of the concurrency control correctness, the size of access units is to coincide with the size of the records; thus reading/writing an access unit is equivalent to reading/writing a record. The initial state of access units is defined as Unlocked and the kernel is instructed on a context switch to notify the view manager of the state transitions effected by the thread being suspended.
When a thread is executing on behalf of a transaction its PDID serves as a transaction ID. When a thread attempts to access a record/access unit, whether the access is permitted depends on the current state of access to that unit, the memory access operation (read/write) and the thread's PDID. The states, as defined by the FSM, include Unlocked, Single Reader (SR), Multiple Readers (MR) and Exclusive. When a thread attempts to read an access unit (record) that is in an Unlocked state, the read is permitted and there is a state transition from Unlocked to Single Reader. The thread's PDID is recorded as part of the state information. If a thread attempts to write to an Unlocked unit, a similar action is taken, but the state changes to Exclusive instead of Single Reader. When a thread tries to read a unit which is in a Single Reader state, the read is permitted and if the thread's PDID does not match the PDID recorded with On a thread switch, the kernel notifies the view manager of the state transitions while the view manager records these in its own data structures. The manager is thus collecting information about which units/records were locked by the thread. If a thread tries to access a unit already locked by another thread in an incompatible mode, an accesscontrol fault exception is raised and the thread is suspended. The kernel informs the manager of the reasons causing the thread's suspension.
When a thread finally completes a transaction it informs the view manager of this. The view manager consults its data structures in order to unlock access units/records locked by the completed transaction. It also checks whether there are any threads suspended due to waiting for locks on units unlocked by the completed transaction. The manager informs the kernel to force the access units that have been locked by the now completed transaction into the Unlocked state. The manager also informs the kernel to resume any threads waiting for the units that were unlocked.
Example
Creation of FSMs, views, mapping of views to memory regions, and interaction between the kernel and the manager are performed using primitives that depend on the specific kernel. Here they are only described in general terms. See [20] for these primitives under the Mach 3.0 and Windows NT kernels. The following example describes access of two threads, each one with a different PDID and hence executing a different transaction, to a data table on which a view is defined to enforce the locking protocol.
The manager first creates the FSM defining the locking protocol (shown in Table 1 ) and then the view itself. When the view is created, the manager instructs the kernel that:
• the FSM defined for locking (shown in Table 1 ) is to be used with the view; • the initial state for each access unit is Unlocked;
• the kernel is to inform the manager of any state transition.
The manager also specifies the size of access units; for simplicity we assume that the size of access units matches the size of records. The view will be referred to as the concurrency control view.
We assume that the data table exists as a permanent object with semantics and data provided by an object manager. When threads of an application task need to access the data table, the task maps a memory region to that data table object. The kernel and the object manager then communicate and the kernel is informed that the concurrency control view is to be mapped to the memory region and also of the identity of the view manager. Note that the object manager may or may not serve the role of the view manager. Any unqualified reference to a manager implies a reference to the view manager. The Mach kernel, for instance, has the vm map function to map a memory region to the task's address space. The function has a specific argument that identifies the task to serve as the manager of the object to which the memory region is mapped. The manager supplies the semantics of the object and also serves as an external pager supplying data as needed. This object manager is also an excellent candidate to perform the role of the view manager [20] . When the vm map function is used to map a memory region to an address space in order to serve as a cache for the data table, the kernel and the manager communicate. The kernel is informed that the concurrency control view is applied on that memory region.
In the following, two threads will access a memory object, the data table, while each one executes a different transaction. The two threads could be within the same task that mapped a memory region to the object/data table. The threads could also belong to two distinct tasks, as long as each task has a memory region mapped to the data table.
Before accessing the object, a thread, say thread 1, must be bound to a protection domain. The thread requests a PDID, which serves as a transaction ID, from the manager. The manager will assign a PDID, say PDID 1, and bind it to the thread by calling the kernel with parameters containing PDID 1 and the identity of the thread. Thread 2 is similarly bound to PDID 2.
Assume that thread 1 now reads from the first record of the data table, that is access unit #1 of the view. Assuming that the unit's state is Unlocked, there is a transition to the state SR (single reader state with the reader's PDID being recorded in the state) and this state transition is buffered in a kernel's internal buffer. The thread is allowed to proceed.
Assume now that thread 1 reads from access unit #2 with a similar result in that the new state for unit #2 is SR (with PDID 1 recorded in the state) and that the thread is allowed to proceed. Later on thread 1 is suspended, perhaps due to its time-slice expiry. Upon the context switch, the kernel informs the manager of the state transitions for units #1 and #2, while the manager records these in its data structures.
Assume that thread 2 now executes and it reads from unit #1. There is a transition to state MR (Multiple Readers), the new state is buffered by the kernel, and the thread is permitted to proceed. When the thread wants to write to the same unit, unit #1, the thread is suspended because of an access control fault: the unit is in the MR state when the thread attempted to write to it and thus the result, as specified by the FSM, is an access control fault. The thread is suspended and the manager is notified by the kernel of the suspension, its cause (attempted operation, view and access unit number), and also of the state changes effected by the thread.
Eventually, thread 1 finishes its execution of the transaction and informs the manager of the transaction's completion. The manager unlocks the units locked by thread 1 by instructing the kernel to force unit #2 to the Unlocked state and unit #1 to SR (Single Reader state with PDID 2). The manager also informs the kernel to resume thread 2. When thread 2 eventually resumes its execution, it re-issues the write operation that caused the write access fault. The write can now proceed while effecting a state transition for unit #2 from SR to Exclusive with PDID 2 recorded in the state. When the thread is suspended, the state transition is communicated to the manager. Eventually the thread instructs the manager of its completion and the manager instructs the kernel to force the state of access for units #1 and #2 to Unlocked.
There are many kernel-specific details, such as removing mapping of threads to PDIDs, that have been omitted. For further details see [20] .
What is important to note in the above example is that a thread does not issue any explicit synchronization instructions-it simply accesses records in data tables while synchronization (locking) is left to the kernel, that is the MVM mechanism. With appropriate hardware support, locking can be achieved by the MVM without software intervention as can be seen in the subsequent section.
Also, the kernel informs the view manager of state changes effected by the thread's execution only on that thread's context switch (or when the kernel's internal buffer used to store the state changes is full). This facilitates the thread's uninterrupted execution.
Finally, the size of access units, although fixed within a view, varies across views and is customizable by applications.
PROTECTION ARCHITECTURE AND PCU
The protection architecture that utilizes a Protection Control Unit (PCU) for enforcement of access control protocols is described next. This is followed by a brief discussion of PCU design options. The section concludes with a description of a cache-based PCU design option that provides hardware support for enforcement of access control protocols. Recall that it is assumed that a virtual address space is flat and that one or more threads belonging to a task can execute in its virtual address space. Another assumption is that the system includes two kinds of caches, an instruction cache and a data cache. Only the data cache is discussed here.
Protection architecture
The architecture, displayed in Figure 2 , shows a virtually addressed, physically tagged, set-associative data cache. Although a virtually addressed cache is shown the architecture is also suitable for a physically addressed cache as will be discussed later.
Because a data cache that is separate from the instruction cache is assumed here, a memory access/operation is either read or write, not execute. The data cache maintains data that is most recently used so that the CPU can fetch the operand for every instruction directly from the data cache instead of from the much slower memory. Data that is not in cache is fetched from memory on a data cache miss. The basic access units in the data cache are organized in lines. In addition to the data itself, each line also includes the clean/modified bit, the valid/invalid bit, and bits needed by the cache replacement algorithm (if any). In the MVM model protection architecture, some other information is included in the lines, namely the W bit indicating if only a read operation or both the read and write operations are permitted on the data in the line. Permission is subject to protection domain validation. Each thread operates in a protection domain, identified by the protection domain ID (PDID). The thread's PDID must match the PDID stored in the data line for the memory operation to be permitted. If the data line referenced by the CPU (thread) is in cache and is valid, then the read access can proceed provided that the PDIDs match. If the access is a write operation and the W bit indicates that the write operation is permitted, the write access can proceed. If the access is a write and the W bit indicates that the write access is not permitted, then a write access fault exception is raised. If the data line referenced by the CPU is not in the cache, then a cache miss occurs. In case of a write access fault or a cache miss, the protection subsystem is used to determine if the access can proceed. The Protection Control Unit (PCU) examines the current state of the access unit containing the address issued by the memory read/write operation and determines whether the operation can proceed. If the write access can proceed, then the read/write bit in the line is set to indicate that the write access is permitted, so a subsequent write access will not invoke the protection subsystem again. Furthermore, the thread's PDID supplied by the CPU is also recorded in the line. Finally the new state of access to the access unit is recorded by the PCU. If the protection subsystem determines that the access cannot proceed, an access control fault exception is raised and the thread is suspended. It should be noted that the protection subsystem is not on the critical path of instruction execution in that it is not invoked on every memory access. It is invoked only on a write access fault or a cache miss. Furthermore, the minimum size of a protection unit is constrained by the size of a data cache line. If a protection domain has certain access rights to a byte in a cache line, then it has the same access rights to all the bytes in that cache line. This implies that the architecture imposes a constraint on the minimum size of access units in that an access unit cannot be smaller than a data cache line.
With the exception of the PCU, the protection architecture does not impose any new constraints on architectures currently in use. Some of the current systems do have a W bit in a data cache line to indicate if the read-only or read-andwrite access is permitted to the line. Virtually addressed and virtually tagged data caches include a virtual address space ID (VAID) in the line to resolve the problem of homonyms.
In the MVM model, PDID replaces the VAID. While a VAID is associated with a task a PDID is associated with a thread, a group of threads, or possibly the task owning the thread in which case all threads have the same access rights to all units of all views. A VAID changes on a task (heavyweight) context switch while a PDID may or may not change on a thread (lightweight) context switch.
For a physically addressed cache, the architecural changes would be minimal as far as the Protection Subsystem is concerned. The real cost would be the addition of the PDID in each data cache line. This cost can be mitigated to some degree by not storing the PDID directly but rather a pointer to a table of PDIDs. This is a modification of the method proposed in [21] to reduce the amount of storage for virtual page numbers. Instead of storing the virtual page number itself in tags of Translation Lookaside Buffers (TLBs), data caches, and in page tables, a pointer (index) to a table of virtual page numbers is stored.
PCU design options
Three possible design options for the PCU were investigated in [22, 23] : software-only option, hardware controller option, and cache-based option. In the software-only option, all PCU activities are implemented in software. In the second option, all activities are performed by a hardware controller accessing information stored in memory. In the last option, all activities are performed by a hardware controller accessing information stored in caches.
It is the last option that has been adopted for two reasons. Examination of the PCU activities/actions reveals that they are relatively simple, consisting mainly of accessing several tables containing information to be described below. That the PCU is table based is natural because the MVM model is based on the concept of FSMs; thus the rules and states are stored in tables. By loading the FSM tables with different rules and transitions, different protocols are implemented. If the tables are stored in fast caches/memory, on-chip with the PCU, hardware support is gained for various protocols. Thus, flexible hardware support for various access control protocols is attainable.
The second reason for choosing the cache-based options is that the TLB has been predicted to become a bottleneck [24] . The cache-based option reduces the pollution of the TLB with access control information in comparison to the other two PCU design options [23] .
Cache-based design of the PCU
The PCU is assumed to be situated in between the CPU and the main system bus [23] . The first-level, on-CPU-chip data cache is assumed to be virtually indexed. Although the cache can be either virtually or physically tagged, in order to simplify presentation and, primarily, because the subsequent evaluation assumes a physically tagged cache, a physically tagged cache is assumed in the subsequent presentation. Any reference to the term data cache used without qualification refers to this L1 cache.
A second-level cache, which is not shown in Figure 2 , is assumed to exist between the data cache and memory. Dedicated data paths exist between the CPU and the secondlevel cache and between the CPU and the PCU. Recall that the PCU is invoked only on a data-cache miss or write-access-control fault. It is assumed for the following description that the access unit sizes and view sizes are powers of two. In addition to the constraint on the minimum size of access units, which cannot be less than the size of the data-cache line, the maximum access-unit size cannot be larger than the system page size. To simplify presentation, it is assumed that there is at most one view defined on any one memory region, but different views can be defined on different memory regions of a virtual-address space.
The PCU consists of three specialized caches: a Protection Lookaside Buffer (PLB), a ViewDefinition cache, and an FSM cache (see Figure 3) . The ViewDefinition cache stores information about the views, the FSM cache stores information about the FSM rules and transitions, while the PLB cache stores the current state of access to the units of views. Because the number of definitions of views and FSMs is relatively small, it is assumed that the two caches are small enough to contain definitions of all views and FSM rules with transitions, respectively, so that a hit rate of one is assumed for each one. Obviously, the PLB cache cannot be assumed to be large enough to store the state information about all units of all views and thus its hit rate is less than one.
A specialized ViewDefinition cache stores information pertaining to the definition of views. Entries within a ViewDefinition cache define views imposed on virtualmemory ranges. A view defines the size of its access units through the Mask attribute, as well as an associated protocol via the FSMID (see Figure 3) that signifies the start of the view is stored in the attribute ViewStart and the last address in the view is denoted by ViewEnd. Each ViewDefinition cache entry contains specialized circuitry that compares the given virtual address with the ViewStart and ViewEnd stored in the entry; on a match, the corresponding entry's information, namely the ViewID, FSMID, and Mask, are retrieved for further processing.
The PLB is a specialized, n-way set-associative cache accessed by a key formed from the PDID supplied by the thread and the information provided by the ViewDefinition table entry. The Mask retrieved from the ViewDefinition table entry is used to mask-out the appropriate number of bits from the physical address. This physical address is then used, together with the PDID, as a key to access the PLB cache; thus support of variable-sized access protection units is achieved. Obviously, the physical tags stored in the PLB entries must have been derived using the same masking mechanism.
Each PLB entry defines the current state of access to a unit identified by the physical address (and the Mask). After the state of access is retrieved, it is passed to the FSM controller which uses it to retrieve an appropriate FSM rule and state transition entry from the FSM cache. The key to access the FSM cache is a concatenation of the FSMID, the state from the PLB, and the memory operation (read/write). The retrieved FSM rule is first examined to determine whether the PDIDs, one stored with the rule and one supplied by the CPU, are relevant, i.e., whether they are to be compared and the result of this match is to play a role in determining the final outcome. The rule, the current state of access and the issued read/write operation are then used to determine the outcome. If the desired cache line access is permitted, the access to the cache line may proceed, otherwise an access control fault arises which must be handled in software. If access proceeds and a transition in the state of access is required, as indicated by the retrieved FSM rule, the new state is recorded in the PLB and also in the state buffer. The thread's state transitions stored in the state buffer are passed to the view manager on the thread's context switch.
In the case where the desired state information is not within the PLB, then there is a PLB miss. The desired state information must be retrieved from the multi-level memory and the PLB access must be repeated. Where state information is to be located in the memory is determined using information stored in the ViewDefinition cache. Each entry contains information about where to find the state information in virtual memory. Each PLB entry also contains the usual information associated with the management of a cache.
EVALUATION
The objective of this evaluation is to determine delays incurred when obtaining locks using the MVM model method and compare them to the delays due to lock acquisition using a conventional lock manager.
A database application, along the guidelines of the Transaction Processing Council's TPC-C benchmark [25] , was chosen for evaluation. The application program's execution was traced and the trace was input to a simulator in order to obtain estimates of the PCU execution delays. A conventional locker was also implemented and its execution delay was determined so that it could be compared to locking delays when using the MVM model.
The study targets the measurement of the cost of lock acquisition without conflicting access-the common case. The main performance metric is delay in machine cycles for lock acquisition on a read/write access. It would have been better to obtain execution time, but there is no actual implementation of the MVM model.
The DB application is briefly described first, followed by a description of the simulated environment, including hardware parameters and cost assignments. Implementation of a standard lock manager and its costing is described next. Finally, execution delays for locking using the MVM model are compared to locking delays of a conventional locker.
Modelled TCP-C application
The TPC-C [25] benchmark represents a generic wholesale supplier workload and specifies different types of transactions that operate on nine shared data tables: Warehouse, District, Customer, Item, Stock, Order, New-Order, Orderline and History data tables. There are five different types of transactions [25, 26] : new order, payment, order status, delivery and stock level.
The type 1 transaction is shown below to illustrate the transactions. For remaining transaction types the reader is referred to any one of [23, 25, 26] . The transactions have been coded as a C program and applied on a custom-coded DB software supporting implementation of relations as flat files, from now on referred to as data tables, with selections and joins implemented using B-trees for efficiency [27] . Execution of transactions operating on the DB was traced using the Tracing Tool (QPT2) [28] . Whenever a new transaction was issued, its type was determined using a uniform distribution with probabilities for transaction types used as specified by the benchmark. Once a transaction type was determined the transaction was executed, while which tuples were accessed by the transaction was determined using skewed random distributions. Thus, although there were many executions of transactions of a particular type, each transaction type execution accessed different tuples in the data tables.
Transactions operate on data tables pre-initialized to the sizes shown in Table 2 . The last four data tables are not preinitialized since their entries are created as transactions are processed.
Tracing the execution of transactions as applied on the DB produced a trace in the form of reads and writes to virtual addresses, a trace then used as input to the simulator. The format of the address trace is as follows. The R stands for a read access and the W for a write access to the 32-bit virtual address represented in hexadecimal notation.
R: efffc40 W: aaa22048 R: effdd330 In fact, four different traces were produced, each one having a specific mix of serial execution of 24 transactions of the five types. The variations in transaction mixes allowed for control of the ratio of read and write accesses within an application and for different locality of access to the tuples. Here we report only some of the results for Transaction mix 1 created from a serial execution of 24 transactions in the following order by type, where the order was randomly generated as per TPC-C specifications: 2, 3, 4, 5, 1, 3, 3, 4, 4, 3, 4, 3, 3, 1, 5, 4, 1, 1, 3, 3, 3, 1, 4 The trace was then post-processed in order to remove the references to virtual memory produced by code execution to drive the production of traces and to identify references to the data tables [27] ; that is any memory references due to the code not belonging to transactions themselves were removed. Examples include memory references produced by instructions to generate random numbers and to control the sequencing of transactions. The post-processing also inserted into the trace begin/end transaction markers and for each read/write reference to the data tables identification to which table/view and to which tuple/access unit the reference refers.
Notes
A trace-driven simulation, rather than an analytical model, was used to obtain values for the performance metrics of the model. Actual program traces facilitate the capture of the characteristics of an application better than analytical models. Many conflicting results have been published in the literature through the use of analytical models to describe, in particular, database application behaviour [29, 30, 31] . These results were conflicting because real application behaviour cannot be exactly modelled by statistical means, particularly due to the numerous assumptions that are inherent to analytical modelling.
It is acknowledged that the sizes of the data tables and the traces in this study are small. The size of the tables was limited by the trace generation tool we were using. The number of tuple insertions was limited by available disk space on the machine on which the simulations were run. A sample size of one of the input trace files which was generated from the application working on the data tables is about 120 Mbyte. The initial input file was then used as input to a utility program which performed the post-processing which generated yet another trace file while processing the first. With advances in computer technology, these sizes already appear to be puny. Nevertheless, at the time of experiments, we worked with disk and memory sizes available to us.
Another drawback of the trace used as the simulator's input is that the version of the tracing tool used was unable to trace operating system code and thus we did not measure the impact of context switching overhead on the TLB and data caches. Even if we had an OS tracing tool, there is no OS that currently supports the MVM model. The measurements of the TLB miss rates are undoubtedly affected by the lack of OS code tracing. Because we do not have a trace of OS code on context switches, the effect of these switches was minimized by the serial execution of transactions: one transaction is run to completion before another starts. That is, the TLB miss rates, and thus the overall performance measurements, are more optimistic in this study than actual values.
Architecture simulator
For the evaluation, a simulator of a hierarchical memory system was designed as described in [23, 32] . The trace discussed above serves as input and the simulator models the content of caches and physical/virtual memory. Statistics are kept on the activity incurred at all levels of the memory hierarchy. Statistics kept on caches include the number of hits/misses due to read access or write accesses, the number of misses and the delays due to caching activities. Statistics on memory are primarily targeted at the number of memory accesses and the number of faults to secondary storage.
Accesses to caches are triggered on read and write references issued by an application program as recorded in the trace. Cache accesses may also occur due to explicit operations such as cache flushes and invalidations. Table 3 summarizes the hardware cost in cycles. All caches are simulated with a copy-back write policy, LRU replacement policy and write allocation. It is assumed that the CPU cycle time is equivalent to the data cache cycle [33] . The data cache is virtually indexed, physically tagged. The line sizes for the L1 and L2 data caches are set at 128 bytes. Read hits on the L1 cache take one CPU cycle (L1R). Write hits (L1W) take two-one to access the tags, and one to access the R/W bit. Write-back with no fetch done on a write miss is the write policy used with the data cache. There is an automatic fetch on a read miss. On an L1 data cache read miss, there is an overlap of the memory-to-cache data transfer and the operations of the Protection Control Unit. However, memory accesses are serialized for all caches with respect to misses. A four-line write buffer between the cache and the memory is assumed. It is also assumed that the memory interference from DMA devices is negligible. The loading of the L2 cache from memory consumes 50 cycles. The L1 data cache is set at 8 kbyte within all the experiments while the L2 data cache size was 128 kbyte (512 sets, 2-way associative).
The line sizes for the PLB, TLB, ViewDefinition and FSM caches are sufficient at 8 bytes. The TLB and the FSM caches have capacities of 256 bytes, or 32 entries, each. The PLB cache has a capacity of 8 kbyte, or 1024 entries. The ViewDefinition cache has a capacity of 128 bytes, or 16 entries. The PLB, ViewDefinition cache, L2 data cache and FSM caches have cycle delays assigned to them assuming that they are situated off-chip.
For details on assignments of cycles for specific activities of the PCU see [23] .
Recall that the FSM and ViewDefinition caches are assumed to be large enough to avoid misses. This is certainly reasonable for this study because there is only one FSM protocol, and only nine views, one per data table. Each view is mapped to the same locking FSM protocol shown in Table 1 .
Software lock manager
A 'standard' lock manager, as outlined in [34, 35, 36] , was implemented. The lock information is found through named access of a fixed-size hash table of locks. A lock is implemented by a Lock Control Block (LCB) which contains information such as its name, its current mode (R or W), and links [34] . These are illustrated in Figure 4 .
The LCB is at the head of two doubly linked lists of Lock Request Blocks (LRBs). One list represents granted requests (must at least contain a list of identifiers of transactions which have access to the lock unit), and the other implements the pending requests (the suspend queue where blocked transactions are held). The LRB contains the transaction identity, the requested locking modes, the access unit name and an indication as to which of the two lists it is presently on. The latter facilitates efficiency in the transaction commit or abort procedures. Each LRB is associated with exactly one transaction (requester).
Obtaining a lock involves accessing the lock table (shown in Figure 5 ) using the resource's name as the hash key. If there is an LCB entry in the hash chain, the LRB chain for Granted Requests is scanned to determine whether the requester already has an LRB. The latter occurs for lock upgrade requests. When there is no LCB entry, a new entry is initialized and appended to the hash chain. If there is no LRB for the requester, the lock manager allocates a new LRB, chains it to the requester's lockset, and chains it to the appropriate LCB chain (granted or pending) according to whether a conflict is determined or not.
Conflict detection is performed upon a request to set a lock. First the current state of the access unit is obtained from the LCB entry in the routine which sets the lock. Then the LRB chain of the Granted Requests is scanned (if it exists) to look for either the LRB entry of the requester or the presence of multiple readers, or the identity of the transaction holding a write lock. The latter of course is tested only if the access unit was originally locked for write. Conflicts are detected when there are multiple readers and a write lock is requested, or a single reader other than the transaction requesting the write lock is on the Granted Requests list, or a write lock is held by a transaction other than the current requester.
A transaction's lock set is maintained on a linked list whose nodes point to LRB entries on both the Granted Requests and Pending queues. When a transaction commits or aborts, the lock set list is traversed to delete the corresponding entries on the Granted Requests and/or Pending lists. 
Comparison of software locking with MVM locking
The average delay, in cycles, expended by the PCU to acquire a read lock is 49 cycles while to acquire a write lock is 45 cycles. Because out of the total accesses to data tables 75% were read accesses, while 25% were write accesses, the average delay to acquire a read/write lock is 48 cycles.
[23] provides a detailed analysis of average delays and also delays of the various PCU and architectural components for four transaction mixes. It also provides sensitivity analysis of delays to variations in key parameters such as cache sizes. Measurements of instruction counts for the software lock manager are obtained using the QPT2 tool. QPT2's output includes a Unix-style flat statistical profile for the implemented lock manager program (see [23] ). For each procedure within the manager, the number of instructions, the percentage of time spent executing the procedure in ms, the number of calls to the procedure, the number of instructions per call and the routine's name are provided. Only totals are presented in Table 4 . All input programs are compiled with gcc 2.6.3 with and without the optimization flag set.
The number of cycles for the setLock() function is obtained by using the following equation [37] : CPU clock cycles = h i=1 CPI i * IC i , where CPI i represents the average number of clock cycles for instruction i and IC i represents the number of times instruction i is executed in a program. If we assume that the SPARC-20's CPI is 1 and that each instruction is identical for simplification purposes, then the number of CPU clock cycles = 1 * 41 = 41 for the setLock() function. This estimate is low because the memory hierarchy costs are not included yet.
The memory hierarchy costs for the entire lock manager are obtained from the cache simulator. Table 5 shows a costing of the memory hierarchy. This is the minimum memory hierarchy cost since no other applications are competing for TLB and data cache space.
As shown in Table 5 , there are 91,037 cycles expended due to delay in the memory hierarchy (TLB, L1, L2 and Main Memory). At 699 calls (see [23] ) to setLock() within the transaction mix, it means that on average 91,037/699 = 131 additional cycles need to be added as an estimate to the overhead to satisfy each lock request. Therefore, roughly (41+131) = 172 cycles will be incurred by the conventional lock manager as opposed to the 48 incurred by the MultiView Memory model's implementation of lock. The Multi-View model provides better performance for lock acquisition at an average of 49 cycles as opposed to 172 cycles for the conventional lock manager. This, indeed is true, but only for lock acquisitions. There are two important points that the evaluation did not capture.
First, the cost of context switching is not included in the above calculations. Context switches, lightweight if the lock manager is in the same address space, heavyweight otherwise, are eliminated for setting locks using MVM locking on currently unlocked access units, on acquisition of read locks to units that are in Single Reader or Multiple Readers states, or on acquisition of write locks. This is because the corresponding state transitions are handled in the PCU unit. In the software lock handler case, however, each lock acquisition is effected by an explicit transfer of control to the locker initiated by the application software. This indicates that the performance advantage of the MVM locking should be higher.
Secondly, the evaluation does not capture additional delays required by the MVM model to transfer information about the locks acquired by a thread/transaction from the kernel to the lock manager and the storage of this information by the manager in its local data structures. This transfer of information, however, occurs only on a thread context switch. If the thread acquires a number of locks (caused a number of state transitions), information about the acquired locks is transferred to the manager using one communication primitive. Hence this delay is amortized over a number of lock acquisitions. In the software locker scheme, there is no need for transfer of information about lock acquisitions to the locker. An application thread makes an explicit request for a lock and the locker itself records appropriate information about the lock in its data structures when it grants the lock.
RELATED WORK
It has been shown [4, 6, 7] that DB and transaction system lock managers outperform locking services provided by an OS because the OS provides efficient locking only on a pagelevel basis using virtual memory primitives, whereas the DB lock managers can provide control at finer granularities. Thus the OS is not generally used by the DB systems for locking services.
Chang and Mergen [13] report on one of the first complete attempts to incorporate transaction management functions in a Memory Management Unit (MMU) within the IBM's 801 storage architecture. A hardware locking mechanism monitors the read and write references of individual transactions and provides for locking of 128-bytes units of data for writes and pages or 128-byte size units for reads. It thus provides finer granularity of locking than Stonebraker's [6] scheme. Furthermore, the scheme also provides for lock acquisition in hardware without software intervention. As in the MVM scheme, information about units automatically locked by a transaction must be collected periodically. The IBM Power architecture does include record-level hardware protection, but this was excluded from the PowerPC architecture.
The protection schemes found in Hewlett Packard's PA-RISC [16] and the PLB protection organization [17] are also closely related to our work in that they show a separation of protection information from address translation. Both the PA-RISC and PLB protection schemes are designed for the 64-bit single address space operating systems in which protection is on access units being pages. Our architecture is structured to provide for variable sized access units within multiple address spaces, but it can be extended easily to the 64-bit single-address space environment. It is further distinguished by support of control strategies through the FSM supported protocols on the access units.
Examples of applications which can efficiently utilize the virtual memory page faulting protection mechanisms appear in [2] where it was also concluded that different applications may perform best using different page sizes. In [38] a multisize paging architecture with elastic page allocation was proposed. In [39] , two page sizes within an address space were examined.
SUMMARY AND CONCLUSIONS
This paper showed that the MVM model can be used to provide efficient locking services by OSs to DB and transaction processing applications. First of all, the MVM model provides for access control, and thus locking, to access units of views. The size of access units is fixed within a view but it may be different in different views. This alleviates one of the major problems in that OSs provide efficient locking services at a system page size utilizing MMU's efficient hardware support for manipulation of protection to pages. Furthermore, the MVM provides for locking without explicit lock requests allowing applications simply to read and write access units. Finally, locks may be acquired without software intervention in certain cases. For instance, a read lock is automatically granted to an unlocked unit.
It was shown that the MVM locking provides an efficient alternative to a conventional software lock manager. Yet, when the efficiency of the MVM locking scheme is compared to the hardware-supported locking of 128-byte and page sized data units provided by the 801, the MVM is inferior. The two schemes are similar in that they enable locking of units of data which are smaller than a page and also that they provide for automatic lock acquisitionsthe two important features required for efficient locking service provided by OSs. What is different about the MVM approach from the 801 hardware locking support, which is faster than the MVM architecture support but was not commercially adopted? The answer lies in the flexibility. The size of access units is not fixed, such as to 128-byte or page size units, but is customizable to applications. More importantly, the MVM model provides support for many access control protocols; locking is only one application. Because in the MVM model approach the access control protocols are specified by FSMs with their states and transitions stored in tables, any protocols that can be represented by FSMs can be supported. Various protocols supported by the MVM model can be found in [23] . Finally, because the FSM and state tables can be supported by caches, hardware support for various access control protocols is attained.
