With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely synchronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the programmer still needs nonlocal reasoning to decide where to place transactions in the code. A significant improvement to the art is Data-Centric Synchronization (DCS), where the programmer uses local reasoning to assign synchronization constraints to data. Based on these, the system automatically infers critical sections and inserts synchronization operations.
Introduction
As chip multiprocessors become widespread, there is growing pressure to substantially broaden their parallel application base. Unfortunately, the vast majority of current application programmers find parallel programming too complex. To effectively utilize the upcoming hardware, we need major breakthroughs that simplify parallel programming.
Developing a parallel application consists of four steps [15] : decomposing the problem, assigning the work to threads, orchestrating the threads, and mapping them to the machine. Orchestration is arguably the most challenging step, as it involves synchronizing the threads. It is in this area that innovations to simplify parallel programming are most urgently sought.
One such innovation is Transactional Memory (TM) [1, 7, 10, 16, 18] . In TM, the programmer specifies sequences of operations that should be executed atomically. TM simplifies parallel programming in two ways. First, the programmer does not need to worry about the intricacies of managing locks. Second, he does not need to finetune critical sections as much, since concurrency is only limited by dependences -not critical section length.
We claim, however, that TM is still complicated: it requires the programmer to reason non-locally. Specifically, when the program-*This work was supported in part by the National Science Foundation under grants EIA-0072102, EIA-0103610, CHE-0121357, and CCR-0325603; DARPA under grant NBCH30390004; DOE under grant B347886; and gifts from IBM and Intel. Luis Ceze was supported by an IBM PhD Fellowship. tIBM T.J. Watson Research Center praun@us.ibm.com mer inserts a transaction annotation, he also needs to think about what other parts of the program may be accessing this same or related shared data, and potentially insert transaction annotations there as well. Intuitively, like inserting lock and unlock operations, inserting transaction annotations involves taking a code-centric approach.
To improve programmability further, we need a data-centric approach [20] . With Data-Centric Synchronization (DCS), the programmer associates synchronization constraints with the program's data structures. Such constraints indicate which sets of data structures should remain consistent with each other and, therefore, be accessed in the same critical section. From these constraints, the system automatically infers the critical sections and inserts thread synchronization operations in the code. DCS simplifies parallel programming because the programmer reasons locally, focusing only on what structures should be consistent with each other.
Existing DCS proposals [20] take user-provided, data-centric synchronization constraints and decide where to insert critical sections using software-only support. In particular, the compiler needs to analyze all the accesses in the code. This is unrealistic in most C/C++ environments, where pointer aliasing is common and, most importantly, dynamic linking denies the compiler access to the whole program.
To make DCS practical, this paper proposes the first design for Hardware DCS (H-DCS). Our proposal, called Colorama, relies on two hardware primitives: one that monitors all memory accesses to decide when to start a critical section, and one that flexibly triggers the exit of a critical section. Colorama is independent of the underlying synchronization mechanism. In this paper, we present a transaction-based implementation and also discuss the issues that appear in a lock-based implementation.
We describe Colorama's architecture, a simple implementation that extends a Mondrian Memory Protection (MMP) [22] system, its programming model and API, and its capacity to help debug conventional codes. We show that Colorama needs few hardware resources and has small overhead. It supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially simplify the task of writing new parallel programs.
In the following, Section 2 introduces DCS; Sections 3, 4, 5 and 6 present Colorama's architecture, implementation, programming environment, and debugging issues respectively; Sections 7 and 8 evaluate Colorama; and Section 9 discusses related work.
Data-Centric Synchronization (DCS)

Basic Idea
In Data-Centric Synchronization (DCS) [20] , the programmer associates synchronization constraints with data structures -typically 1-4244-0805-9/07/$25.00 ©2007 IEEE when they are declared or allocated. These constraints specify which data structures are in the same "data consistency domain" and, therefore, should be kept consistent with each other. This means that when one structure is being modified, all the other structures in the same domain need to be protected from access by other threads. To support this model, when a thread accesses a structure of a domain, the thread automatically enters a critical section for that domain. No other thread can now access structures of that domain. When the thread finishes working on structures of that domain, the thread automatically exits the critical section.
DCS is in contrast to conventional Code-Centric Synchronization (CCS), where synchronization constraints are associated with code. In CCS, the programmer marks what code is inside which critical section.
We argue that DCS has a significant advantage over CCS in programmability. CCS requires the programmer to reason nonlocally [20] : every time he inserts a transaction begin/end or a lock acquire/release annotation in the code, he also needs to think about what other locations in the program may be accessing this same or related data structures, and potentially insert synchronization annotations there as well. Instead, with DCS, the programmer reasons locally, focusing only on what data structures should be consistent with each other. The system automatically infers the critical sections.
The shortcoming of DCS stems from limited program knowledge. The system has to automatically infer when the code enters and exits a critical section, so that it can insert the appropriate synchronization operations around the section.
Identifying entry points to critical sections largely involves identifying accesses to data structures belonging to a domain. Identifying exit points is harder. It is typically impossible for the system to know when a thread has stopped working on structures of a given domain and, therefore, the critical section for that domain should terminate. Consequently, DCS schemes have an Exit Policy, which is a simple, clear algorithm for terminating a critical section. The exit policy used by the system is communicated to the programmer. This is because, to write correct code, the programmer needs to know the exit policy used, and write code in agreement with it. We believe that having a simple exit policy is an acceptable burden given the improvement in programmability provided by DCS.
Software DCS (S-DCS)
DCS has only been implemented in software, under limited environments. The main example of what we strictly consider Software DCS (S-DCS) is Vaziri et al.'s Atomic Sets [20] . This system includes a compiler and language extensions to Java. The programmer, when declaring Java classes, can group several fields into an Atomic Set. The elements of an Atomic Set are supposed to be manipulated atomically inside critical sections that are automatically created by the compiler.
The entry points of critical sections of an Atomic Set are inferred by the compiler by statically analyzing the code and identifying likely accesses to data belonging to the Set. Since Java is relatively analyzable due to type safety and the lack of pointer arithmetic, if the compiler has access to the whole program, then it can conservatively identify when data from Atomic Sets are accessed [20] .
The exit policy used by Vaziri et al. is to insert the exit point of a critical section right before the return of the Java method that contains the corresponding entry point. This policy builds on the intuition that a method is a natural unit of work -a method is typically exited when the work is completed. Therefore, a single method includes both the entry and the exit points of a critical section.
Proposal for Hardware DCS (H-DCS): Colorama
S-DCS is unsuitable for popular languages such as C/C++, which allow pointer arithmetic and aliasing. Since the compiler cannot fully analyze the code due to lack of pointer information, it can only generate conservative critical section approximations of very limited use. Alternatively, if it inserts instructions to check the address of every pointer access dynamically, it induces intolerable overhead.
More fundamentally, in environments with dynamic linking, deployment of S-DCS is impractical because the compiler may lack access to the whole program.
Therefore, this paper proposes a novel architecture to support DCS in hardware. The resulting Hardware DCS (H-DCS) scheme is called Colorama. It supports any type of access pattern, has low overhead, and is usable in any language.
Colorama has two primitives, corresponding to the need to identify critical section entry and exit points. The first one is hardware to monitor all addresses issued by the processor with very low overhead. If a thread accesses a structure belonging to a consistency domain from outside of a critical section for that domain, Colorama starts a critical section.
The second primitive is hardware to support the exit of a critical section. Such primitive is very flexible and is driven by the compiler, so that different exit policies can be supported. At all times, however, it has to be clear to the programmer what exit policy will be used by the compiler as it generates the executable. In this first paper, however, we simply use the exit policy used by Vaziri et al. [20] . We use it because it is very intuitive. For example, Wang and Stoller [21] use the heuristic that methods execute atomically to identify potential atomicity violations in Java programs.
Note that the support for Colorama does not replicate (and is largely independent of) the support that the machine provides for synchronization. In this paper, we propose a Colorama implementation that relies on transactions as the underlying synchronization mechanism. We also discuss the issues that appear in an implementation based on locks.
Examples of Colorama Programming
In Colorama, a data consistency domain is called a Color, while a memory region with structures belonging to a consistency domain is referred to as Colored. In this section, we show three motivating examples.
Linked List. Consider a linked list that is manipulated by functions that insert a node, delete a node, and traverse the list (Figure 1) The per-thread structure is the Thread Color Status. It contains the set of ColorIDs currently owned by the thread. These are the colors whose critical sections are currently being executed by the thread. They are listed in the Owned Colors Array.
The Thread Color Status also provides an efficient hardware primitive for the software to implement the exit policy. The primitive is built around the two Color Bitmap Registers: the read/write Color Acquire Bitmap (CAB) register and the write-only Color Release Bitmap (CRB) register (Figure 4 ). These registers have as many bits as entries in the Owned Colors Array (e.g., 64). Every time that a ColorID is inserted in location i of the Owned Colors Array, the corresponding bit in the CAB register is automatically set in hardware. In addition, when the software sets bit i of the CRB register, the hardware triggers a critical section exit for the ColorID in the corresponding entry of the Owned Colors Array.
3.2. Chosen Critical Section Exit Policy As indicated in Section 2.3, in this paper we choose the exit policy used by Vaziri et al. [20] : trigger the exit of a color's critical section when the thread returns from the subroutine where the critical section was entered. We choose it because it is simple and intuitive: a subroutine is a natural unit of work; when the subroutine returns, the thread is likely to have finished the operation it was doing and, therefore, stopped working on that color's structures. Some evidence that programmers already follow this convention informally is presented later (Section 8.1). Note, however, that in DCS, writing correct code requires that the programmer be aware of the exit policy supported by the system and follows it. This policy is implemented with the compiler-inserted instructions shown in Figure 5 (d). At every subroutine entry, the compiler saves the CAB register in the stack and then clears it. This does not affect the Owned Colors Array ( Figure 4 ). As the subroutine executes, if a new color becomes owned, the corresponding bit in the CAB register gets automatically set. Before the subroutine returns, the compiler copies the CAB to the CRB register, thereby triggering the exit of all the critical sections entered in this subroutine. Then, it restores the CAB register from the stack, leaving it in the state it had before the subroutine was called. This algorithm works with any nesting.
Detailed Colorama Operation
Based on the previous discussion, we now 
Pointers as Subroutine Arguments
Sometimes, a critical section performs multiple operations on a structure, and invokes one subroutine per operation -passing as argument to each subroutine a pointer to the structure. This is common when handling complex structures such as hash tables. Figure 6 To use this primitive for our purposes, we extend the Colorama compiler to identify subroutine calls with arguments that are pointers. For every such argument, the compiler inserts a colorcheck instruction with that argument, right before the call -in the example, the argument is htPtr. The resulting code is shown in Figure 6 (c).
This change accomplishes what we need. At run time, colorcheck checks the contents of htPtr before readHash() and triggers the start of the critical section.
Why Use Multiple Colors
If the system supports nested transactions, having multiple colors provides an intuitive way to build transaction nests [17] : every time a new color is accessed inside a transaction, a new nesting level is created.
Irrespective of whether or not the system supports nested transactions, having multiple colors is also useful in three ways. First, it can help debug the code. Specifically, every time a processor attempts to commit a transaction, as it broadcasts the addresses that it wrote, we propose that it also broadcast the colors that the transaction owned. If a second processor that is executing a differentcolor transaction detects a collision with the committing one, the programmer is warned that a bug is likely-different-color transactions should not have collisions.
The second use is to help optimize the cross-thread dependence disambiguation that takes place at thread commit. If we are certain that the code has no bugs, we may decide to reduce overheads by not checking for collisions between concurrent transactions of different colors. This may save inter-processor traffic.
The final advantage of supporting multiple colors is that it enables the programmer to embed more information in the program on how shared data are used.
If the system uses locks, instead, supporting multiple colors directly translates into enabling more concurrency (Section 4.3).
Implementation of Colorama
Colorama Structures
The Colorama structures are the Palette and the Thread Color Status (Figure 4) . The Palette is a distributed structure implemented part in hardware and part in software. It is accessed with a pattern similar to that of structures that contain address protection informationi.e., which address can be read or written by which thread. Indeed, protection information is also shared by all threads and is accessed at every memory request. Consequently, both types of information can share the same implementation. One difference is that the Palette contains per-word information, while current virtual memory systems associate protection information with pages. Consequently, to accommodate the Palette, we would need to redesign current TLB structures. In practice, there is already an efficient design that manages per-word protection information, namely the Mondrian Memory Protection (MMP) system [22] . Therefore, we implement the Palette as extra bits to be stored in the MMP structures.
The implementation of an MMP system is shown as the white structures of Figure 7 (a). The Multilevel Permissions The shaded fields in Figure 7 (a) constitute the Palette. They simply add the ColorID bits to the three MMP structures. Figure 7 (b) shows a PLB entry in detail. A PLB entry may correspond to a cache line. The Palette adds a ColorID (e.g., 12 bits) to every word contained in the PLB entry -e.g., 16 x 12 bits for a 16-word line. A load or store automatically checks the ColorID of the address accessed, which is typically in a sidecar register or in the PLB. register, and the write-only CRB register (Figure 4) deadlock has occurred. Then, the handler informs the user of where the deadlock happened.
We consider this support to be a debugging aid. We expect that, as programmers become familiar with Colorama's programming model and whatever exit policy is used, they will write code that executes fast and reliably.
Note that deadlocks do not exist in a transaction-based implementation of Colorama. Transactions are known to be susceptible to livelocks, but they are easily avoided.
Programming with Colorama
The goal of Colorama is to simplify parallel programming. One of the ways in which Transactional Memory (TM) simplifies the programmer's job is by not requiring so much fine-tuning of the critical sections -concurrency is limited by dependences, not critical section length. With Colorama, the programmer's job is further simplified beyond TM because he does not even need to mark critical sections -the system automatically infers them. The result is highly programmable and maintainable code. In this section, we examine several programming issues in Colorama.
Correctness
At a minimum, Colorama guarantees that all executions of critical sections of the same color by different threads are serializable. Consequently, if the programmer colors all the shared data structures that should be accessed in an exclusive manner, Colorama produces a data-race free program. All conflicting accesses will be separated by transaction boundaries or lock operations.
The extent and granularity of coloring typically matter relatively little in a transaction-based implementation of Colorama, since concurrency is only limited by data dependences -although long transactions with resulting cache overflow are slow. However, they matter substantially more in a lock-based implementation. In this case, if the programmer colors structures for which the accesses do not need to be constrained (e.g., thread-private variables), the resulting superfluous critical sections or longer-than-necessary ones may limit concurrency and lower performance. Conversely, a programmer can enable more concurrency if variables that do not have mutual consistency constraints are assigned different colors. This may improve performance.
If the programmer fails to color a structure that should be accessed in an exclusive manner, the program may have data races. Likewise, if he assigns different colors to structures that have mutual consistency constraints, or if he does not respect the exit policy of the system -in our case, by continuing to manipulate an exclusive structure past the corresponding subroutine return -the program may function incorrectly.
Code Compatibility Issues
A program written for Colorama may be linked with libraries that do not use Colorama's Application Binary Interface (ABI) -for example, they use explicit transactions or locks. In this case, no special action needs to be taken. The legacy library will use transactions or locks to protect its own data structures, not program data. For library-accessed program data, Colorama will continue to trigger critical section entries on access and (if the library executes program code through a callback) critical section exits on subroutine returns.
In certain exceptional cases, applications may require the absence of Colorama's default exit policy. For example, consider an infinite loop where a consumer thread reads data from a shared buffer that is filled by a producer thread. If programmed with transactions, every access to the buffer would be a transaction. In Colorama, if the shared buffer is colored, the whole infinite loop would become a single critical section. To avoid this case, the programmer (or compiler) has to explicitly release the buffer's color at every iteration. As another example, to implement a wait on condition variables, the programmer (or compiler) will want to be able to temporarily release a color and then re-acquire it.
These operations are available through a Colorama library as follows. First, consider releasing the color associated with an address. The library first uses a Colorama instruction called getcolorid (Section 4.1). Such instruction simply returns the ColorID of the address. Then, the library searches the Owned Colors Array (Figure 4 ) to find the array offset where that ColorID is stored. If found, the library writes to the CRB register a set bit at the same offset, which triggers the release of ColorID. Note also that we can release all colors by writing all ones to the CRB register.
Releasing a color temporarily involves releasing the color as before and saving the address. Re-acquiring a color involves using the colorcheck instruction on the saved address.
Colorama's Complete API
Colorama's complete API is shown in Table 1 . It contains five instructions, three system calls, and four library calls. The instructions are colorcheck, getcolorid, and moves to/from CAB or CRB. The system calls color or decolor addresses. The reason why these operations are system calls is that they update the PLB, which also contains protection information (Section 4.1). These system calls are typically issued when data structures are allocated or deallocatedthey are rarely issued otherwise. Possibly, the two coloring system calls could be inserted directly by the compiler, based on language syntax extensions that specify colors when data structures are declared. Moreover, the decolor system call could be insidefree(). Finally, the rationale for the four library calls in Table 1 was presented in Section 5.2. Typically, only experienced programmers would use the library calls.
Example: Prevention of an Atomicity Violation
Finally, to showcase the advantages of Colorama's programming simplicity, we show one example where Colorama helps prevent a subtle synchronization defect. Figure 9 shows Java method append, which appends one string to another. It calls methods length to get the length of a string and getChars to copy the string. The figure also shows a call to append string sb to string sa.
Method append is annotated as synchronized, which means that it executes under mutual exclusion with other synchronized methods invoked on sa. Methods length and getChars are also synchronized. However, when they are called from within append in the example, they are synchronized with other methods invoked on sb. As a result, although the individual interactions of length and getChars on sb are atomic, the sequence of interactions is not: it can happen that string sb is altered by another thread in-between the length and getChars calls -resulting in a stale value of len at the point of calling getChars.
In Colorama, defects such as this one are prevented. If string sb is colored, as soon as it is first accessed inside append, a critical section starts. With the exit policy used, the critical section extends to the end of the method -therefore encompassing the calls to length 
Code Debugging Issues
While we argue that programming in Colorama is simpler and less error-prone than in the conventional CCS approach, it is still possible to have bugs. In this section, we examine how to debug Colorama code. In addition, we also consider a related question, namely leveraging the Colorama hardware to debug conventional CCS code.
Debugging Colorama Code
We classify Colorama bugs into three classes: (i) failing to color a structure that should be colored; (ii) coloring two structures from the same consistency domain with two different colors; and (iii) violating the exit policy. The bugs in class (i) can lead to data races, which can be detected with conventional data-race detection tools. They can also lead to collisions between critical sections of different colors, which are easily detected by Colorama (Section 3.5).
The bugs in classes (ii) and (iii) cause atomicity violations. They can be debugged with conventional tools that use heuristics to detect atomicity violations [6, 21] .
The bugs in class (iii) are unique to DCS. For the exit policy used in this paper, they occur when the programmer assumes that a critical section extends past its corresponding subroutine return. The exit policy, of course, triggers a critical section exit at that particular return. Fortunately, we can use simple heuristics to identify possible 140 instances of these bugs. The procedure is to record the colors of the critical sections that exit at a given subroutine return i. Then, we check if the thread accesses any of these colors again before the next N dynamic subroutine returns -where N can be 1. If it does, the programmer is warned, as he may have expected that the color's critical section had extended beyond the return i. Note that this procedure only relies on single-thread information -not on information dependent on the access interleaving of multiple threads. As a result, the bug manifests deterministically.
Debugging CCS Code with Colorama Hardware
A programmer who writes conventional CCS code on a machine with Colorama hardware can benefit from additionally annotating the data structures with colors as in DCS. Such annotations, if they drive the Colorama hardware without actually starting critical sections, can help debug the CCS code. As an illustration, assume that the programmer has written the CCS code with transactions. In this case, the Colorama hardware can detect when the following rules are violated, which is a strong indication of a bug.
1. Colored data should only be accessed inside transactions. Accesses from outside are typically bugs.
2. As indicated in Section 3.5, transactions of different colors should not collide. The Colorama hardware records the colors accessed by each transaction. A collision between two transactions of different colors likely suggests that the programmer was unaware of some data sharing.
3. A non-nested transaction should typically access only one color. If a transaction accesses multiple colors, there may be an opportunity for transaction nesting that could be flagged to the programmer. More than a bug, this is possibly a missed optimization opportunity.
4. A subroutine should not typically contain two transactions of the same color. As pointed out in [21] , functions that manipulate shared data in parallel programs are often intended to be atomic. Therefore, having two transactions of the same color in the same subroutine rather than one may be a bug.
Experimental Setup
Since there are no programs written for Colorama, our evaluation consists of analyzing existing lock-based applications and estimating Colorama's potential and overheads. We analyze a variety of large, open-source, realistic multithreaded applications written in C or C++. Among them are the AOL web server, the Firefox web browser, the MySQL database server, and others. Table 2 lists the applications along with their number of dynamic instructions, critical sections (static and dynamic) and peak memory footprint, as they run natively on a Xeon-based multiprocessor with 8 hardware contexts. We developed a Pin-based [14] tool that profiles our applications running natively with multiple threads. The tool tracks synchronization operations and collects information such as lock acquire and release sites, lock addresses, and critical section executions and sizes. It also collects other events such as instruction counts and memory allocations and deallocations. The tool is also connected to a simulator that models a Multilevel Permissions Table for MMP [22] with Palette extensions (Figure 7(a) ).
Synchronization operations are typically calls to multithreading libraries such as Pthreads. Many times, however, applications synchronize with indirections to pthread functions or with actual application code. An example is Tcl MutexLock and TclMutexUnlock, part of the TCL library used by aolserver. Our profiler can handle such cases as well.
Evaluation
We evaluate the suitability and impact of our chosen Colorama exit policy, and then examine Colorama's structure sizes and overheads.
Suitability of Colorama's Exit Policy
This section presents experimental evidence showing that the exit policy that we choose for Colorama in this paper is already an informal convention largely followed by programmers of CCS code.
Consequently, requiring its compliance for correct DCS code would likely be a light burden. For this experiment, we determine, for each critical section executed by the applications, whether the lock acquire and release are in the same subroutine. If they are, the section is matched; otherwise, it is unmatched. Figure 10 shows the percentage of dynamic (D) and static (S) critical sections that are matched or unmatched. Recall from Table 2 that individual applications have 10K-3303K dynamic critical sections and 6-485 static ones. From the figure, we see that matched critical sections account for practically all the dynamic sections, and for 95% of the static ones. This supports our choice of exit policy. It shows that programmers already tend to initiate and conclude a critical section in the same subroutine.
The few unmatched cases are either special cases or are in code that is very fine-tuned for concurrency, especially in libraries. Figure 11 shows a representative unmatched critical section from GTK. In the figure, subroutine g-main dispatch assumes that it holds lock context. Inside the subroutine, before the invocation of callback function dispatch, the code releases the lock; after the invocation, the code acquires the lock back. This structure would not be compatible with our exit policy. In this particular case, however, Colorama can handle this code without any changes because it is library code (Section 5.2). 
Impact of Colorama's Exit Policy
The exit policy that we have chosen has two potential implications: the critical section size increases and independent critical sections may get combined in a nest. These issues typically have little or no impact in our proposed transaction-based implementation of Colorama. However, in a lock-based implementation, the first issue could increase lock contention and the second one could, under certain conditions, cause deadlock (Section 4.3).
To assess the first issue, we measure the average dynamic size of each critical section in its lock version (from acquire to release) and in what would be its Colorama version (from acquire to subroutine return). The resulting cumulative distribution is shown in Figures 12(a) and (b) , respectively.
While the dynamic sizes of critical sections do increase, the average increase is not excessive. In some applications, there are a few critical sections that increase in size substantially. For example, this occurs for the sound thread in tuxracer. The thread acquires and releases a lock at the beginning of the game, and then runs for the duration of the game without returning from the subroutine. However, we believe that, since the Colorama programmer is required to know the system's exit policy, he will write the code to avoid lengthy critical sections. To assess the case of independent critical sections being combined into a nest of critical sections, we measure how often multiple, independent critical sections have their entry points inside the same subroutine. These are the ones that would be combined into a nest. Figure 13 shows the percentage of dynamic (D) and static (S) critical sections that, because of Colorama's exit policy, would end up combining with an independent second critical section, by nesting it inside. Such instances are called Combined. From the figure, we see that on average only about 1% of the dynamic critical sections and 4% of the static ones end up nesting a second critical section in. A detailed analysis of these (few) cases shows that the resulting order of any pair of nested locks is always the same-which eliminates the possibility of getting a deadlock. Consequently, we conjecture that the possibility of deadlock will be rare.
Colorama Structure Sizes
To estimate the sizes of the Colorama structures in Figure 4 , we perform several measurements on the applications. We conservatively assume that every time an application allocates or deallocates memory, it adds or deletes, respectively, a colored region. Consequently, the number of "live" allocated regions plus the number of static data objects in the binary gives the total number of colored regions at a time. This number is shown in Column 2 of Figure 4 (or the number of bits in the CAB and CRB registers), we need to measure the maximum number of locks held by a thread at a time. To be conservative, we measure the maximum number of locks held at a time by all threads combined. Such number is shown in Column 5, and ranges from 4 to 39.
The last row of Figure 5 as argument, the compiler adds one colorcheck instruction. Overall, we could assume that, on average, Colorama adds about seven instructions per subroutine invocation. As a reference, Column 6 of Table 3 shows that, on average, about 1.6% of the dynamic instructions are subroutine calls.
In reality, the resulting overhead is likely to be very small. First, the added instructions are mostly register moves and loads/stores that hit in the cache -since they access the stack; they can easily fill the many unused execution slots in superscalars. Moreover, the compiler does not need to add these additional instructions for the subroutines that it can prove do not access colored data. Finally, applications often execute library code, which is not subject to this overhead.
A second source of overhead is the execution of the user-level handlers to enter and exit critical sections. However, the contribution of these instructions is very small, given the low frequency of critical section entry and exit. Such frequency is given by two times the numbers in Column 5 of Table 2 over the numbers in Column 3 of the same table.
Finally, Colorama also executes coloring system calls. We conservatively assume that every time the application allocates or deallocates memory, it issues one such call to add or delete a colored region, respectively. Column 7 of Table 3 shows the frequency of these system calls. For four applications, they are issued on average only once every 129K-288M instructions. In this case, the overhead is negligible. In three other applications, they are issued once every 2K-4K instructions. In these applications, the frequent memory allocation/deallocation is already very costly in itself. We can eliminate most of the additional cost of coloring by having the memory allocator keep pools of colored memory. As a result, there is no need to issue a system call at each of these operations.
Additional Memory Space. The large majority of Colorama's memory overhead is due to the Palette. To compute the Palette's overhead, we model in detail the MMP's Multilevel Permissions Table of Figure 7 (a) in our simulator. We use the Mini-SST format of the entries, as suggested in [22] . We measure two memory space overheads: the one for the base MMP with permissions information (white part of the Permissions Table in Figure 7 (a)), and the one for the Palette state only (shaded part in Figure 7(a) ). Figure 14 shows these two memory space overheads as a fraction of the application footprint. For a given bar, both these two overheads and the application footprint are the peak values for the whole application execution. For additional information, the figure models ColorID fields that range from 8 to 32 bits.
The figure shows that the Palette adds only a very modest overhead over that of the base MMP. On average, for the range of ColorlDs used, the Palette only adds 1-2.5% more space to the footprint of the application.
9. Related Work Section 2.2 described the work that we strictly consider S-DCS, and how it differs from Colorama. To that discussion, we add that Atomic Sets [20] are what we call colors, and that Vaziri et al. also allow the programmer to explicitly associate external methods to an Atomic Set, which arguably breaks the pure data-centric approach.
Other systems that support a less flexible form of DCS are languages [2, 3, 8] with concurrency control based on Monitors [11] . In such languages, it is possible to specify a shared data structure and the set of procedures that are allowed to access it. The compiler will then add the necessary synchronization operations to make these 100'25 Several works have associated data objects to synchronization information for a variety of purposes. For example, in Entry Consistency (EC) [4] , the association is done to enforce memory consistency in a distributed shared-memory system. The programmer explicitly associates shared locations with locks. When a processor enters a critical section by acquiring a lock, the associated shared locations are made consistent. An important difference with Colorama is that in EC, the programmer explicitly marks the critical sections in the code. This makes EC code-centric, with some data-centric annotations.
Having to explicitly list the shared data associated with a critical section is a burden to the programmer. As a result, Scope Consistency [13] improves on EC by having the software system automatically infer the shared data accessed in the scope of each critical section. Still, the programmer has to mark the critical sections.
Like Colorama, Xu et al. [23] try to infer critical sections, although the approach and environment is very different. They examine a post-mortem trace of memory references after a bug has been detected, and propose heuristics to infer the code that should be in critical sections. They use this information to estimate if a synchronization was missing. The Colorama hardware cannot directly use their heuristics to decide when to enter/exit a critical section because their scheme requires access to future references and to references from other processors. Moreover, their heuristics can have false positives and false negatives. However, their scheme could be usable in other DCS designs.
Other related works include: (i) programmer-specified association between code and data for static or dynamic validation of parallel programs (e.g., [19] ); (ii) programmer-specified "transactional" variables in composable memory transactions [9] that provide stronger atomicity guarantees; and (iii) the lock bits associated with memory regions in the IBM 801 [5] , used to support transactions on memory-mapped I/0.
Conclusions and Future Work
To reduce the complexity of parallel programming, this paper has proposed Colorama, the first design of Hardware DCS (H-DCS).
Colorama relies on two nimble hardware primitives to make DCS practical: one that monitors all memory accesses and one that can flexibly trigger the exit of a critical section based on a mechanism programmed in software. We have described Colorama's operation with transactions as the underlying synchronization mechanism. Moreover, we have presented Colorama's simple implementation based on MMP, its programming model and API, and its capacity to help debug conventional CCS codes. Finally, we have discussed the issues that appear in a lock-based implementation.
The evaluation assessed the policy chosen in this paper to exit a critical section at the return from the subroutine where the critical section was entered. We showed that this exit policy is already an informal convention largely followed by programmers of CCS code. Consequently, requiring its compliance for correct DCS code will likely be a light burden at most. We also showed that the policy increases critical sections modestly on average, and rarely combines critical sections -issues largely relevant to a lock-based implementation. The evaluation also showed that, by building on top of an MMP system, Colorama requires only modest hardware resources and induces small overheads.
Overall, Colorama effectively supports general-purpose, pointerbased languages such as C/C++ and, in our opinion, can substantially simplify writing new parallel programs beyond transactions. Our future work involves: (i) developing and evaluating new exit policies, (ii) writing and evaluating large Colorama programs and (iii) combining S-DCS and H-DCS into a hybrid system.
