Abstract-Mainstream chip multiprocessors already include a significant number of cores that make straightforward snooping-based cache coherence less appropriate. Further increase in core count will almost certainly require more sophisticated tracking of data sharing to minimize unnecessary messages and cache snooping. Directory-based coherence has been the standard solution for large-scale shared-memory multiprocessors and is a clear candidate for on-chip coherence maintenance. A vanilla directory design, however, suffers from inefficient use of storage to keep coherence metadata. The result is a high storage overhead for larger scales. Reducing this overhead leads to saving of resources that can be redeployed for other purposes. In this paper, we exploit familiar characteristics of coherence metadata, but with novel angles and propose two practical techniques to increase the expressiveness of directory entries, particularly for chip-multiprocessors. First, it is well known that the vast majority of cache lines have a small number of sharers. We exploit a related fact with a subtle but important difference: that a significant portion of directory entries only need to track one node. We can thus use a hybrid representation of sharers list for the directory. Second, contiguous memory regions often share the same coherence characteristics and can be tracked by a single entry. We propose an adaptive multi-granular mechanism that does not rely on any profiling, compiler, or operating system support to identify such regions. Moreover, it allows co-existence of line and region entries in the same locations, thus making regions more applicable. We show that both techniques improve the expressiveness of directory entries, and, when combined, can reduce directory storage by more than an order of magnitude with negligible loss of precision.
INTRODUCTION
T ECHNOLOGY scaling has steadily increased the number of cores in a mainstream chip-multiprocessor. Specialpurpose large-scale chip-multiprocessors (CMPs) are also appearing in the marketplace [1] , [2] , [3] . Shared-memory programming interface is still a crucial element in productively exploiting the performance potential of these chipmultiprocessors. Consequently cache coherence will continue to be a key requirement of chip-multiprocessors. The increasing core count makes pure snooping protocols less appropriate. A directory-based approach will be increasingly seen as a serious candidate for on-chip coherence solution.
While directory-based coherence design has been studied extensively in the context of conventional multiprocessor design, chip-multiprocessors present important differences that call for new solutions or new twists. For example, conventional multiprocessors are built from commercial, offthe-shelf processor components that are fabricated with a focus towards personal systems. The directory logic is implemented outside the processor chip, whereas in a chip multiprocessor, the directory can closely interact with other onchip logic. Also, in conventional multiprocessors, the cost of directories is only incurred when building a multiprocessor. In contrast, directory cost is incurred for every chip multiprocessor. Cost saving is thus more important.
In the most basic incarnation, a directory entry is allocated for every memory line [4] , and each entry uses a full bit vector to track the list of sharers of that line. The overhead of a full bit vector is clearly significant in a system with many cores. For instance, a 64-core system with 32-byte cache lines, the overhead is 25 percent. Directory storage can be viewed as a two-dimensional array, with the height being the number of entries and the width being the number of sharers to track. Reducing storage requires reducing one or both dimensions. And there are plenty of characteristics of access and sharing patterns that allow us to reduce the two dimensions: only a small portion of all memory lines are cached at any time; most cache lines have a small number of sharers, etc. [5] , [6] , [7] , [8] , [9] , [10] . Additionally, we can adjust parameters to reduce overhead such as using larger cache lines or coarse-grain sharing vectors. Regardless of the exact mechanism, reducing the storage size comes at a loss of precision in coherence tracking, which leads to extra messages, invalidations, misses, and ultimately performance loss. However, the resources saved can be redeployed elsewhere to make up for the performance loss. In particular, other meta information such as access pattern, frequency, and affinity can help processor optimize data placement, storage allocation and so on [11] , [12] . Chip multiprocessor presents brand new opportunities to provide holistic on-chip data management solutions. Providing expressive, area-efficient directory systems is a starting step.
One issue about the conventional directory mechanism is that it is a mechanical, access-pattern-agnostic approach to tracking coherence. Sharing patterns are tracked cache line by cache line even though there may be much more expressive means. For instance, private data have no other sharer than the owner. The coherence information of a whole region can be described by a single metadata entry. Similarly, code segments and other (mostly) read-only data need not be tracked line by line either. Note that in practice, exploiting private data and code segments is not trivial, as these access patterns are not architected. Relying on help from external agents (e.g., compiler, programmer) brings its own set of issues. In this paper, we do not attempt to identify these patterns directly, but exploit the consequence of these patterns. First, (partly) because of prevalence of private data, many sharers lists can be represented by single pointers. Therefore we use a hybrid representation of sharers list within the directory: a single-pointers array plus a vector array. Second, because of regions of private or readonly data, we can use a single entry to capture the state of many cache lines simultaneously. Additionally, we allow exceptions within the region. Previous work with regions [13] , [14] use fixed-size, private region entries. In our paper, we pay attention to both private and shared regions and there are two region sizes. This flexibility makes such regions more common than otherwise. Both types of savings can be achieved with simple, practical architectural support.
The rest of this paper is organized as follows. Section 2 outlines the background and related work on directory optimizations. Section 3 presents our design in detail. Section 4 gives the evaluation result of our schemes and Section 5 concludes.
BACKGROUND AND RELATED WORK
Typically, all the cached copies of a memory line are invalidated on eviction of the associated directory entry. Increasing the size of a directory cache can reduce the frequency of evictions and cache miss rate but aggravate memory overhead and power consumption. As the number of processors increases, more directory entries are required to track a growing number of cache lines. And the size of full bit vector grows linearly with the number of processors. As a result, the aggregate area of directory cache grows as the square of the number of processors, making them very expensive in large-scale systems [15] , [16] .
A handful of schemes are proposed to reduce the size of a directory entry. Most of them compress the vector into a compact structure based on limited pointers [5] , [8] , [17] , [18] , [19] . When the number of shares exceeds the number of pointers, the pointers are changed to be coarse vector [8] , hierarchical vector [20] , [21] , [22] , or stored into memory [5] . Scalable coherence directory (SCD) [19] combines limited pointers and hierarchical directory. Dynamic pointer allocation scheme [10] associates pointers dynamically based on the number of shares. SPACE [23] and SPATL [24] encode the vector and store the code in the directory entry. Segment directory scheme [7] chops the full bit vector into smaller pieces and stores only the non-zero pieces each with an identifying pointer so that the full bit vector can be reconstructed. Multilayer clustering [25] , tristate [17] , gray-tristate [26] , and home [26] store processor numbers in the sharing state. Tagless directory [27] uses bloom filter to encode the tags in each private cache. Data structures like chain [6] , [28] and tree [29] , [30] are used in directory entry compression. None of these proposals takes advantages of the difference between directory entries. Hybrid set [31] uses a combination of vector and pointer within a cache set. Our proposal uses separate vector and pointer arrays for shared and private data to improve the utilization of vector entries.
In a shared memory system, the temporarily private and/or read-only memory lines can avoid coherence tracking [32] , which we will refer to as temporarily untracked. With the aid of the translation lookaside buffer (TLB) and the operating system [32] , temporarily untracked lines are recognized in advance and sharers tracking can be avoided, thus saving directory entries. This can also be achieved with the help of the specific hardware [33] , [34] , [35] . These region-based schemes are most effective when memory layout is such that data with the same access behavior (e.g., private, or read-only) are grouped together, separated from other types, and placed in aligned regions. In reality, without conscious layout optimization, different types of data may well be commingled and render these region based schemes less effective. Indeed, it is found that temporarily untracked lines make up about half of the space inside coherent pages [32] . In our analysis (with a broader set of applications), more than 60 percent of the lines in coherent pages are temporarily untracked. Allowing exceptions is an important factor in making region-based tracking effective. Spatiotemporal coherence tracking (SCT) [13] and multigrain directory (MGD) [14] use fixed-size, private region entry to track the private lines of the same owner in a region, while the other lines are tracked with line granularity. Multi-granular tracking [31] takes advantages of region tracking but treats the region as shared. Our proposal adjusts the directory entry type and size dynamically to fit the access pattern of the region.
HYBRID ARRAY (HA) AND ADAPTIVE MULTI-GRANULAR TRACKING
Our goal is to reduce the memory overhead of directory cache for effectiveness and scalability. The total size of directory cache is the product of directory entry size and the number of entries. We target both in this paper. Note that since we are only reorganizing how sharing vectors are stored, the coherence protocol processing including race conditions and their handling is -with a small exception (Section 3.3.3) -intact. Below, we first describe the underlying baseline directory cache. We then discuss hybrid array and adaptive multi-granular tracking that target reduction in the size of directory entry and the number of entries, respectively.
Baseline Directory Cache
We assume a tiled chip-multiprocessor baseline where each tile contains a core, private caches, and a slice of the globally shared cache illustrated in Fig. 1 . Data are mapped to their home L2 slice in a round-robin fashion based on their (physical) address. In the most straightforward implementation, the directory is directly attached to the data cache, where the sharers list is essentially part of the cache line's metadata. Alternatively, the directory is decoupled from the data cache. In this case, the number of directory entries need not match the number of data cache lines. Indeed, a directory cache [8] , [9] is one such example where the number of directory entries is significantly smaller than the number of cache lines. Even considering that the directory cache has to include a separate tag field, the result is almost always a net savings in total storage. A natural side effect of such decoupling is that inclusivity is only necessary between the directory entries and L1 cache lines. In other words, when an L2 cache line is evicted, there is no need to invalidate the corresponding L1 cache lines. Only when the directory entry is evicted are corresponding L1 cache lines invalidated.
Hybrid Array
In terms of directory entry size, a full bit vector provides complete freedom to track all possible sharing patterns. But the size is large and grows linearly with the number of processors. Earlier studies have shown that a large number of lines are shared by only a few processors at any given time [36] . A handful of compact structures are proposed to replace the full bit vector based on this knowledge. Compact structures cannot track as many sharers as vectors and may cause pointer overflow [37] . Schemes are proposed to address the issue of pointer overflow at a cost of extra traffic and performance degradation [8] , [17] . In all these designs, a single entry type is used, which necessarily requires the entry to be versatile enough to cover the common cases. In contrast, in an earlier work, we have exploited sharing pattern from a different angle: a significant fraction of the cache lines have only one sharer/owner at a time [31] . If we consider the storage of directory entries in a set as a whole, we can use "hybrid sets" with a few entries in a set capable of tracking multiple sharers and the rest tracking just single sharers [31] . By pooling resources together, a single entry no longer needs to be versatile and self-sufficient. The cost is that when a single-sharer-tracking entry proves to be insufficient, pointer overflow happens and intra-set swapping is performed. In our particular implementation, pointers are used for single-sharer entries (which are called pointer entries or PEs) and full vectors are used for multi-sharer ones (vector entries or VEs) as illustrated in Fig. 2 . Assuming a P -way CMP, with an A-way associative directory cache that has V vector entries per set, the average size of a directory entry is thus ðV Ã P þ ðA À V Þ Ã log 2 P Þ=A þ TagSize.
While hybrid sets can already greatly reduce storage demand, there is one limitation. The number of vector entries per set can only be integer number no less than 1. The number of vector entries necessary can be less than 1 on average and varies greatly from set to set. Indeed, as we can see from Fig. 3 , more than 90 percent of the sets have no need for vector entries at any particular time. At the same time, a small minority of about 5 percent of the sets require two or more vector entries. To limit directory-induced invalidations that hurt performance and energy efficiency, we found that using two vector entries per set is a prudent choice [31] , but the utilization is not high.
One simple way to improve the utilization is to further pool the vector entries from each set into a separate vectorentry array akin to a victim cache. We call this array-level hybrid representation "hybrid array" (shown in Fig. 2 ). The system works as follows. Each cache line is represented by an entry in the main array of the directory cache with pointer entries. For single-sharer lines, the pointer entry stores the owner node of the data as before. For multi-sharer lines, the pointer points to an entry in the vector entry array, which contains the vectors representing the full sharers list. Of course, a bit (the "V" bit shown in Fig. 2 ) is added to the pointer entry to differentiate the two types of pointers.
With this organization, the number of vector entries can be set to the appropriate number without the constraints of being a multiple of the number of sets as in the hybrid set design discussed earlier. Moreover, since the vector entries are accessed via the pointer -not by the indexing using address -the vector array is essentially fully associative when evicting an entry. The side effect is that for multisharer lines, the pointer and vector arrays are accessed sequentially. It is worth nothing, however, that the extra Fig. 1 . System overview of a tiled chip-multiprocessor. Fig. 2 . Storage for conventional directory, hybrid set, and hybrid array. Fig. 3 . The ratio of number of multi-sharer entries for an eight-way associative directory cache in a 16-way CMP. With the exception of barnes, the ratio of three or more multi-sharer entries in a set is negligible.
access is only on the critical path for an invalidation request. In our studies, only about 3.6 percent of directory transactions require sequential access to both arrays and they are all faithfully modeled in our simulations.
Since a pointer entry can only track one sharer, when another processor joins the sharers list, a pointer entry can no longer track both sharers. We handle such an "overflow" by allocating a vector entry for the requesting pointer entry. We first check the free list, which tracks all free entries in the vector array. Vector entries can become free in a number of ways: when a line or its directory entry is evicted; when an invalidation makes the line owned by a single node; or when the number of sharers reduces to one after eviction (in a system with non-silent drop). In all these cases, upon detecting that a vector sharers list is no longer needed, the identification of the vector entry is entered into a free-list (maintained as a stack).
If the free-list is exhausted, a victim is selected for eviction -we use pseudo least recently used (LRU) algorithm for selecting the victim. Since the victim loses the capability to track individual sharers with a vector (but still has the pointer entry to keep limited information), we "round" the sharers list either down to a randomly selected current sharer (which we call down conversion), or up to all nodes (up conversion as shown in Fig. 4 ). The direction of conversion is chosen based on comparing the number of sharers to a threshold to minimize the loss of precision. In the former case, other sharers are invalidated. In the latter case, a single broadcast bit ("B" in Fig. 2 ) is set in the pointer entry to indicate that all nodes potentially share the cache line. Each vector entry contains a reverse pointer back to its pointer entry.
When a vector entry allocation occurs, the directory cache is occupied for more than one cycle, potentially increasing queuing delays for other transactions. Fortunately, such allocations are very rare: only about one per one thousand transactions involves an allocation (experimental setup will be discussed later in Section 4, and the occupancy is faithfully modeled).
Finally, the operations of hybrid array are summed up in Table 1 .
Adaptive Multi-Granular Tracking
In addition to reducing the size of each directory entry, reducing the number of directory entries is another factor of reducing the overall footprint of directory cache. One approach is to exclude regions that contain only private data. These regions can be considered to be temporarily untracked regions. The challenge of such an approach is the determination of private data as it is not semantically guaranteed by the architecture. Thus if one line in the region is detected to be non-private and thus require coherence tracking, the entire region ceases to be treated as temporarily untracked.
We propose multi-granular tracking which takes a different approach to reducing the number of directory entries. The general idea is that coherence tracking granularity should be decoupled from that of cache storage and adapt to the sharing behavior of the program: if an entire region of memory has the same sharing pattern, there is no need to repeat that information for each individual constituent line. Ideally, each directory entry tracks a natural region with the same (or close enough) sharing pattern regardless of the size. But a practical region implementation is probably limited to aligned regions with a powerof-two size.
Region entries are employed to capture a region of data with similar access patterns such as private and read-only. Whenever the region is read, the processor is added to the sharers list of the region. Regular line-tracking entries are used to track the modified cache lines (which we call exceptional lines) as special cases inside an otherwise homogeneous region. By allowing exceptions, we enable region entries to become more effective. Spatiotemporal coherence tracking [13] also uses region entries to track multiple lines. It sets the first processor accessing a region as the owner. When a line is accessed by other processors, the line is thought to be an exceptional line and tracked with the line entry. Both these two techniques use region entries as an optimization that describe common patterns, while multi-granular tracking [31] treated the region as shared and spatiotemporal coherence tracking [13] treated the region as private. These two techniques are referred to as shared region scheme and private region scheme later in this paper.
Adaptive Region Type
Intuitively, having a fixed region type is suboptimal. This can be seen in the illustrative examples shown in Fig. 5 . In the case shown on the left, the most efficient way to represent the sharing is a private region (owned by node a) with an exception (line 2). If a region can only be shared, then lines 0, 1, and 3 will have to be treated as special cases, resulting in no savings of directory entries. In contrast, the region on the right side is more efficiently represented by a shared region. Intuitively, the type of a region entry should be allowed to change dynamically according to the access pattern of the region. We propose adaptive region scheme to achieve this purpose.
Adaptive Region Size
Both shared region scheme [31] and private region scheme [13] use fixed-size, aligned region entries for simplicity. The design decision becomes the size. Clearly, a larger size creates a more compact tracking when the region is homogeneous, but can lead to more space waste when the actual size of a region with homogeneous sharing pattern is smaller. This can be seen in the example shown in Fig. 6 . In this example, the chosen region size (eight lines) is bigger than the actual size of the two regions of homogeneous sharing patterns. As a result, the number of entries needed is more than that when the region size is four lines. Ideally a region size would adapt to the actual size for better effectiveness. Supporting a large flexibility in region sizes, however, is complex in terms of hardware implementation. We propose a compromise that uses twolevel aligned region entries to help track homogeneous regions with different sizes. We call these two levels (regular) region and super-region. Such a two-level scheme adds little hardware complexity, but provides more flexibility in tracking regions of different sizes. The following experiment illustrates this point more quantitatively.
In this experiment, we take a snapshot of the line-tracking directory periodically (every 5,000 cycles). For each snapshot, we can see if we have region entries, how many line entries can be replaced by region entries. Note that the total number of entries calculated this way is only an approximation of the necessary capacity in a real design but illustrates several points nonetheless. Fig. 7 shows the number of directory entries normalized to the line-tracking directory in different cases.
From left to right, the bars represent, respectively, a scheme with a fixed region type (private) and region size (1 KB), a scheme with adaptive region type and a fixed region size (1 KB), and an adaptive region type with two different region sizes (2 KB/1 KB). We can clearly see the reduction of the number of line entries needed as we increase the flexibility of region entries. We can also see that with two levels of granularity, we already reduce the number of line entries to a rather insignificant level.
Basic Operation
In our proposed design, new regions will be created with a default type and size. The default region type is private as on the balance, private data is more frequent. The default size is the regular region, the middle-level granularity. As operation continues, the type and size may be adapted later -upon an eviction of a directory entry (Section 3.3.5). We first describe the basic operation below.
Upon the first access to a line without an existing directory entry (line or region), we start with a private region entry. The implicit, optimistic assumption is that the whole region is owned by the processor. Given an existing region entry, any subsequent access from the owner to other lines within the same region will be tracked by the region entry. The region entry's state is changed to "modified" if any cache line of the region is modified by the owner.
If a private region entry receives a read or write request from a node other than the owner, a line entry will be allocated to track the line under request. The owner of the region will be sent an message to inform the change of the line. (The owner may or may not have that particular cache line.) Clearly when a region is tracked by both line and region entries, line entries take precedent. For a region entry in shared state, any additional nodes reading from any cache line (not already served by a line entry) in that region will be added to the region's sharers list. When a write request arrives at the directory, we treat this line as a special case and start a line entry to track it. All sharers of the region will be sent an invalidation message just for the line.
These operations of adaptive region scheme are summarized in Table 2 . Here are a few implementation issues worth noting.
When an L1 cache miss is serviced, we may find a region entry of the line in shared state and at the same time no data in the L2 cache since we do not maintain L2-L1 inclusivity. In this case, if the region entry is in shared state, we do not have a high confidence that any node on the sharers list actually has the line: Intuitively, the L2 cache has a much larger capacity and is thus unlikely to evict a line still in some L1 cache. (We found through simulations that in those cases there is only a small chance -2.4 percent on average -that some L1 cache actually has the data on-chip.) For efficiency and simplicity, we go off-chip for the data. In the case where the cache line is indeed on-chip, it is possible that the node with the cache line issues an upgrade request, which arrives at the directory before the off-chip access is complete. This race is a result of maintaining non-inclusive caches and can happen whether or not we choose to search the data on-chip when the L2 does not have it. The race is easy to detect and handle. Because of the pending off-chip access, the line is in transient state. Upon receiving the upgrade request, the directory can detect that the line in question is in transient state. For fairness, the directory should NACK the upgrade request and included in the NACK a request to supply the data, though it can also NACK the original read request for later retry and handle the upgrade first. The performance impact of the choice should be negligible as we have only seen one such race every 0.3 billion cycles. In contrast, when serving an L1 miss and the line is tracked by a line entry, we do check the sharers first before going off-chip. In that case, there is a much higher chance that some sharers have the data. Of course, if the line is tracked by a private region entry, the directory will forward the request to the owner node first.
When a node notifies the directory for evicting a cache line (in a non-silent drop style implementation), the node is only taken off the sharers list if the line is tracked by a line entry. If it is only tracked by a region entry, no change to the sharers list will be made. Upon the eviction of a region entry from the directory cache, we need to inform all the sharers to invalidate all lines that are only covered by the region (or super-region) entry, i.e., lines that are not tracked by an entry of finer granularity. These lines will be identified by the directory and the information is attached in the invalidation message. The L1 cache controller will act accordingly.
Implementation Considerations
To support multi-granular tracking, two grain-size bits are used to distinguish between super-region, region, and line entries. The natural indexes for these entries are different, calling for separate accesses to different sets in the directory cache. We opt for a different indexing mechanism that maps all entries of the same super-region into the same set. This approach will map consecutive cache lines' directory entries into the same set and split the tag for line entries and region entries into two segments, one on each side of the indexing bits as shown in Fig. 8 .
Mapping consecutive lines into the same set may cause increased conflict in the directory cache. Intuitively, given reasonable associativity for the directory cache, the magnitude of the problem should be small. Besides, region entries naturally eliminate many individual line entries that would otherwise exist, further reducing conflict pressure. Indeed, in our simulations we found that mapping the directory entries of the same region to the same set is slightly better than using their natural indexes. Of course, such indexing will become less appealing for bigger super-regions (see Section 4.3.1).
When multiple entries match the line address of a request, the entry with the smallest grain size takes precedence and will be used. Other entries are not counted as a hit in the LRU replacement circuitry. This approach has a natural effect of favoring entries with the "right" granularity. For example, a region that has a diverse set of sharing patterns among its lines will have many individual line entries and the region entry will quickly fall into disuse and get recycled. In a sense, the system is thus automatically choosing the right type of entries to use depending on the circumstance.
Our directory design does require additional logic, which in theory can increase access delay. In reality, this is more than compensated by the significant reduction of directory cache size. For instance, we can reduce the baseline directory cache size by 8x without noticeably impacting the directory's capability to capture the sharing pattern (Section 4.3.2). We synthesize the circuit with design compiler [38] in 65 nm CMOS technology [39] . (Fig. 9 shows the circuit of tag access for adaptive region scheme.) The baseline (128-set, eight-way) directory cache takes 1.32ns, whereas the adaptive region design with a smaller storage (16-set, eight-way) takes 1.05ns, a 21 percent improvement.
Region Size and Type Adaptation
As discussed above, when the region's type and size do not match real behavior, the representation becomes suboptimal and wastes storage in the directory cache. The general idea of our adaptation is simple: occasionally inspect all entries of a region to see if an alternative representation reduces such wastes. It is important to note that depending on directory cache pressure, waste alone may not cause harmful directory cache evictions and the act of adaptation is not without costs. Thus we only perform inspection for potential adaptation when there is a directory cache eviction, signaling need for more directory capacity. Also note that we intentionally map all cache lines belonging to the same region into the same directory cache set. This not only makes it easy to allow exceptions as discussed earlier, but also greatly simplifies the inspection process for adaptation as a single read of the set is sufficient to provide all the necessary information. This information is then handed off to a separate unit for adaptation. The directory remains available to subsequent requests and is unaffected by such background adaptation. Our adaptation policy is as follows.
(1) Filtering. The chief concern motivating region adaptation is that the original selection of region type and size may be suboptimal, leaving the system to use an unnecessary amount of individual line entries. Thus, our first job is to see if there are many line entries that belong to the same region and hence suggesting possibilities of better tracking with region entries. Since many different regions can map to the same set, we randomly select a line entry to determine the In that case, they will be merged into a single super-region entry. When two region entries are of the same location but incompatible type, the question becomes which one should we keep as a region entry to more succinctly and precisely track coherence information: in other words, which one will require the least number of exception entries? To figure this out, we send a query message to all nodes involved to collect information about how many lines are covered by the existing region entries. Upon collecting the replies, we make a decision about whether to replace the existing region entry with the new one. The criterion we use is whether the replacement will reduce the number of exception entries by two or more. If so, the original region entry will be discarded and line entries for its constituent lines will be formed. Our modeling of the entire adaptation process and policies has been on the conservative side. For example, the replacement is an uncommon action and we envision (and have modeled) a blocking implementation where the adaptation engine will not accept another adaptation request until the previous one is handled and the region currently under adaptation inspection is being placed under a transient state which will postpone any coherence request to the region. A more sophisticated adaptation algorithm is possible but will likely involve a complicated implementation, possibly requiring a programmable engine executing a nontrivial sequence of code, which we deem to be unrealistic to expect.
Combinations
The two techniques described can be combined together or with other space-saving techniques in a rather straightforward manner. For instance, in adaptive region scheme the sharers list can be implemented in either pointer or vector format as in hybrid array. As another example, the vector entries in hybrid array can also be replaced by a few pointers as in LimitLESS directory [5] . In some cases, applying multiple techniques working on the same source of storage inefficiency quickly reaches diminishing returns. The techniques can be contrasted based on cost benefit ratio. Section 4 contains some quantitative analyses.
EXPERIMENTAL ANALYSIS
We analyze in detail hybrid representation (Section 4.2) and multi-granular tracking (Section 4.3) in isolation as well as in conjunction (Section 4.4).
Experimental Setup
Our quantitative analyses are performed using an in-house execution-driven simulator for multiprocessors. We have incorporated Wattch [40] and ORION 2.0 [41] for power analysis. The execution-driven simulator models in great detail the cache coherence substrate using a MESI protocol, the processor microarchitecture, the communication substrate, and the energy consumption of 16-way and 64-way CMP systems. The system parameters are summarized in Table 3 .
We perform the evaluation with a suite of parallel applications including SPLASH-2 [42] benchmark suite, PARSEC [43] , a program to solve electromagnetic problem in three dimensions (em3d), a program to iteratively solve partial differential equations (jacobi), a three-dimensional particle simulator (mp3d), a shallow water benchmark from the National Center for Atmospheric Research to solve differential equations on a two-dimensional grid for weather prediction (shallow), and a branch-and-bound based implementation of the non-polynomial traveling salesman problem (tsp). The applications are summarized in Table 4 . The cache sizes are smaller than typical values to compensate for the reduced data sets used by many applications. With these sizes, the average L1 miss rate is 5.6 percent.
The directory cache of baseline is configured to be 128-set and eight-way associative per slice. This configuration is chosen since it performs close to a system with a full directory. Further reducing the size of the directory cache will cause serious performance degradation. At this configuration, assuming a 40-bit physical address and a 16-way CMP, the directory storage overhead (including the extra tag) comes to about 11 bits per L2 cache line, or about 2.0 percent, compared to 3.6 percent for an in-L2 directory. In a 64-way CMP, those overheads would rise to 4.2 percent for directory cache and 12.6 percent for the in-L2 directory.
Hybrid Array
In the following, we use a 16-way CMP as baseline for analysis (Section 4.2.1); then show that the design's impact on execution is not sensitive to configuration parameters; and finally show that the technique saves more space with far less performance impact than related designs (Section 4.2.2). The baseline conventional directory cache is noted as DCðcore Â sets=core; associativityÞ. In particular, the baseline configuration DCð16 Â 128; 8Þ has 16 cores, each with 1,024 directory entries organized as 128 sets and eight ways per set. The proposed hybrid array will be noted as DCðcore Â sets=core; assoc:Þ VAðcore Â entries=coreÞ where VAðcore Â entries=coreÞ gives the configuration of the vector array.
Effects on the Baseline System
We propose hybrid array to reduce the ratio of vector entries. First we simulate with the baseline to observe the ratio of multi-sharer directory entries. We take a snapshot of the directory every 1,000 cycles and get the average ratio of multi-sharer entries. The experimental result shows that only 1.8 percent of the entries track multiple sharers on average. So there is great potential when reducing the ratio of vector entries in the directory. A smaller ratio of vector entries means less bits for each entry.
When the number of vector entries is limited, there is an increased chance of victimizing a vector entry. The result can be a down conversion that invalidates all but one current sharer. Such invalidations have nothing to do with program's true communication and are purely due to directory imprecision. We call these invalidations directory-induced invalidations (DIIs). Note that without hybrid representation, a directory cache already creates DIIs. In our experiment, we vary the total number of entries per core from 256 down to 32 -DCð16 Â 128; 8Þ VAð16 Â SÞ, with S being 256, 128, 64, and 32. In these configurations, the ratio of vector to pointer entries is 1/4, 1/8, 1/16, and 1/32, respectively. Fig. 10 shows the relative number of DIIs for different applications under those configurations. For clarity, we sort the applications based on decreasing values and only show the several applications where these values are large. The rest of the 25 applications have results very close to one, that is they are essentially insensitive to the number of vector entries in this range.
First, we can see the increment of DIIs as the ratio of vector entries decreases. Using just 32 entries per core for vectors (representing for 1/32 of the pointer entries) is a bit too extreme and, for a few applications, can dramatically increase DIIs. And for the case of 64 entries (1/16 of pointer entries), the increase in DIIs is only 3.7 percent. From the perspective of DIIs, using 32 entries or less may be too aggressive. However, it is worth noting that normalizing the number of DIIs highlights the imprecision introduced by the hybrid array and perhaps greatly exaggerates the effect. This is because overall, DIIs represent a small portion of the total number of invalidations (15.1 percent in the baseline). So even a several-fold increase may not cause significant increase in the overall invalidation number. This can be seen from Fig. 11 , which shows the number of L1 cache misses normalized to that in baseline.
Take application lu for example. Using vectors for 1/32 of the entries increases DIIs by more than 9x. But this dramatic increase in DIIs only results in 63 percent increase in L1 misses (raising the L1 miss rate from 1.9 to 3.1 percent). On average, the cache misses increase by 3 percent when using vectors for 1/32 of the entries. Using DCð16 Â 128; 8Þ VAð16 Â 64Þ (1/16) or more vector entries, the increase in the number of misses (and off-chip traffic) is roughly 0.3 percent on average, and 3.1 percent in the worst case, which is almost negligible.
Recall that victimizing a vector entry that tracks more than one sharer results in either a down or an up conversion. While down conversions increase the number of DIIs, up conversions cause imprecision that increases the number of unnecessary invalidation messages later on. In DCð16 Â 128; 8Þ VAð16 Â 64Þ, the increase in all invalidation messages averages about 1.1 percent. Of course, both type of conversions lead to performance loss and energy overhead. Finally, the allocation of vector entries increases the occupancy of directory cache, potentially delaying other requests. According to the experimental results, the port usage increases by 1.7 percent on average for DCð16 Â 128; 8Þ VAð16 Â 64Þ. All these factors impact the overall execution time and energy. These statistics for DCð16 Â 128; 8Þ VAð16 Â 64Þ are shown in Fig. 12 .
As the figure shows, when using vector for 1/16 of the entries, the execution time, number of network packets, and energy consumption of DCð16 Â 128; 8Þ VAð16 Â 64Þ increase by a negligible amount (specifically 0.1, 0.2, and 0.1 percent, respectively). Since we use 1/16 of the entries as vector entries, the asymptote of area savings is 16Â. The actual savings in storage depends on a number of parameters. For the 16-way CMP, DCð16 Â 128; 8Þ VAð16 Â 64Þ reduces the total directory storage by about 1.35Â (26 percent reduction). In contrast, if the directory cache is shrunk by a quarter (in the number of sets), the performance degradation is a much more pronounced 10 percent.
In a 64-way CMP, keeping each tile the same, the area savings becomes 2.6Â. The execution impact is essentially the same as in a 16-way CMP. The increase of execution time, number of packets, and energy consumption are 0.5 percent or less on average and 2.3 percent for the worst case.
Comparison with Related Schemes
We compare hybrid array with other compacting schemes, including our earlier design hybrid set (HS) [31] , limited pointer (LP) [17] , and coarse vector (CV) [8] . The baseline is DCð64 Â 128; 8Þ, where each directory entry is composed of 26-bit tag and 64-bit vector. The LP uses four pointers to replace the vector. Each set of HS is composed of six pointer entries and two vector entries. And the ratio of vector entries is 1/16 in HA. Table 5 shows the relative area savings and performance degradation of these schemes compared to the full bit vector scheme in a 64-way CMP. As the table shows, hybrid array outperforms other schemes and causes negligible degradation in both network traffic and execution time. Hybrid array is able to track the sharers more precisely with the same storage for its dynamic allocation and adaptive conversion schemes. We also evaluate the combination of hybrid array and other schemes. Take the combination of hybrid array and limited pointers for example, we use a limited pointers array instead of vector array. As the table shows, hybrid array is able to exploit some aspects of the information redundancy that other schemes do not and is able to reduce the storage further with little additional performance degradation.
Adaptive Multi-Granular Tracking
In the following, we first analyze the appropriate size for region in Section 4.3.1, then show the overall effect in Section 4.3.2, and finally compare to shared region scheme [31] , private region scheme [13] , and another alternative design [32] in Section 4.3.3. We use the notation DC ARðcore Â sets=core; assoc:; size superregion Þ 1 to represent a specific configuration of adaptive region scheme. DC P is used for private region scheme. Again, we start from the baseline of DCð16 Â 128; 8Þ.
Region Size
Region size is an important parameter in our design. Increasing the size of a region reduces the cost to track a large, homogeneous region but increases the chance a region is no longer homogeneous. While it is possible to explore the design space with brute-force simulations and measure the performance of system with various region sizes, we chose to observe the natural behavior of cache lines in a baseline system with an ideal directory (unlimited size). To this end, we take a snapshot of the directory every 25,000 cycles. For each snapshot, we figure out the best organization regarding how to track the cache lines given different sizes of region and super-region, and record the total number of all directory entries needed. Note that this exercise is only an approximation of what an actual design will achieve. Nevertheless, it can provide a rough guidance on the choice of (super-)region sizes.
We explored a total of 78 combinations with the smallest region size being four cache lines and the largest superregion size being 16K lines (or 1 MB). For each configuration, the result is the average number of directory entries over all snapshots collected. The normalized results for all configurations are shown in Fig. 13 . The figure shows that configurations on both extremes are not as effective as those in the middle of the range we explored. The configuration showing the smallest number of directory entries needed is 512-32 (lines), though a large number of configurations nearby have similar results.
One crucial issue this analysis does not account for is the implementation considerations, especially the design decision to map line entries and region entries in the same directory cache set. In such a design, when the region size becomes too big, there will be significant conflict in the directory. Consequently, the overall optimal design points lie in the area with much smaller (super-)region sizes. In our setup, the best configuration has a 32-line super-region and a 16-line region. The performance of close-by configurations is not significantly different.
Effects of Adaptive Region Scheme
Region entries magnify the descriptive power of a directory entry, thereby requiring less storage. Depending on the application, the average magnification factor ranges from 4 to 11, with a suite-wide average of about 8. In other words, a region entry replaces an average of eight line entries, albeit with a slight loss of precision. Specifically, compared to baseline DCð16 Â 128; 8Þ, the number of DIIs of DC ARð16 Â 16; 8; 32Þ is 10 percent lower on average for an 8Â reduction in directory size, though with significant variation among applications. In contrast, a mere 2Â reduction in directory size (to DCð16 Â 64; 8Þ) results in an average of 10Â increase in DIIs and consequently a 41 percent increase in L1 misses. An 8Â reduction in directory size (to DCð16 Â 16; 8Þ) would make those increases balloon to 25Â and 89 percent, respectively.
We evaluate the performance impact of directory cache size with and without multi-granular tracking. Fig. 14 shows the relative performance of different configurations normalized to an ideal directory cache system. The associativity of directory cache is fixed to be 8 and the number of sets is changed from 4,096 to 128.
As the figure shows, without multi-granular tracking, the degradation in performance is negligible when the size is reduced to 2,048 sets for the whole chip (or DCð16 Â 128; 8Þ), but noticeable even with a small reduction from 2,048 (which is why we set the baseline to this configuration). Indeed, halving the entries to 1,024 would reduce performance by more than 30 percent. In stark contrast, when employing multi-granular tracking, with only 256 sets, the performance is 0.7 percent better than DCð16 Â 128; 8Þ, which has 8Â more entries, and only 0.7 percent lower than DCð16 Â 256; 8Þ. Note that there is non-trivial variation from application to application. However, even in the worst case, the performance impact is 7.5 percent as shown in Fig. 15 . We have also evaluated multi-granular tracking in a 64-way CMP. With an average increase of execution time, number of packets, and energy consumption of less than 1.0 percent, the observations are essentially the same as in the 16-way CMP.
Comparison with Other Schemes
In this section, we compare the adaptive region scheme with other multi-granular tracking schemes including private region scheme [13] and shared region scheme [31] . And we also compare with a coarse grain coherence tracking scheme [32] which uses page translation information to bypass private pages as temporarily untracked regions. There are a number of differences between the multi-granular tracking scheme and coarse grain coherence tracking scheme. First, multi-granular tracking is completely transparent to other subsystems and does not require modifications to TLB design. Second, regions can be in any coherence state and are not limited to just private pages. Third, multi-granular tracking scheme allows exceptions, making regions more applicable. The net effect of these differences is that given the same storage capacity for entries, our directory is able to capture the coherence states with less DIIs and cache misses. As Fig. 16 shows, using a 256-set directory cache with the page-based technique [32] , the L1 cache misses increase by 15 percent on average over the baseline DCð16 Â 128; 8Þ. In contrast, with our adaptive region scheme, the L1 cache misses decrease by 2.3 percent. Fig. 17 shows the performance of different approaches of leveraging regions. We can see that our adaptive region is significantly better than shared or private region, or using TLB to bypass pages. The effect is especially obvious with smaller numbers of sets. There is, however, an advantage page-bypassing has. That is that bypassed pages occupy no entries in the directory at all. It is conceivable that bypassing can be implemented on top of multi-granular tracking to allow further reduction of directory expenditure. Indeed, when the two are combined, the resulting scheme shows a smaller performance loss than our adaptive region multigranular tracking alone, and extends the range of acceptable performance to even smaller configurations of directory cache. However, the additional benefit is small and may not justify the design intrusion needed to support page bypassing.
Adaptive Region Scheme and Hybrid Array
To see if our two proposals would affect each other, we implement adaptive region scheme and hybrid array together in the 16-way CMP. As cache lines are consolidated into region entries, the ratio of multi-sharer entries changes. Intuitively, many single-sharer entries are private lines. They are particularly amenable to region-based tracking and thus single-sharer entries see more reduction than multi-sharer entries. Furthermore, the sharers list of the region entry is the union of the sharers of the constituent cache lines, increasing the number of multi-sharer entries. The experimental result shows that the ratio of multi-sharer entries increases from 1.8 to 9.2 percent.
To sum up the overall effect, when the directory configuration changes from the knee-of-the-curve baseline directory cache DCð16 Â 128; 8Þ to DC ARð16 Â 16; 8; 32Þ VA ð16 Â 8Þ which has 10x smaller storage requirement in 16-way CMP (or 20Â smaller in 64-way CMP), the average performance degradation is 0.5 percent.
CONCLUSIONS
We have proposed an expressive, area-efficient directory cache design in this paper. Two techniques (hybrid array and adaptive multi-granular tracking) are used to reduce directory entry size and directory entry number. Hybrid array uses a combination of vector array and pointer array in the directory cache, allowing most entries to rely on a single pointer. Adaptive multi-granular tracking allows co-existence of line and region entries not only about the same locations but in the same directory cache set. The result is that region entries are more applicable and the system adapts to the right type and size of regions. Moreover, the use of region does not rely on any profiling, compiler, or operating system support. Finally, the two techniques combine naturally, reaping multiplicative savings benefits.
We have evaluated the two techniques in isolation as well as in conjunction in both 16-way and 64-way CMP systems. This is done with a detailed execution-driven simulator across a wide range of parallel applications. When the two simple techniques are combined, the directory becomes much more expressive and area-efficient: Space expenditure is reduced by more than an order of magnitude (to about 0.2 percent of the L2 cache) while the loss of precision costs a small performance penalty of 0.5 percent. Such space efficiency can allow better scalability of directory without resorting to more drastic design overhauls. Also, some of the space saved can be redeployed to track access patterns and offer a more intelligent support of on-chip communication.
Peng Liu received the BS degree in optical engineering, the MS degree in optical engineering, and the PhD degree in communication and electronic system from Zhejiang University, Hangzhou, China, in 1992, 1996, and 1999, respectively. In 1999, he joined the faculty of the Information Science and Electronic Engineering Department, Zhejiang University. From 2009 to 2010, he was a visiting scholar at the University of Rochester working on high-performance computer architectures. His research interests include computer architecture, multiprocessor system-on-chip architectures, on-chip interconnection networks, parallel computer architectures, and VLSI design. He is a member of the IEEE and the IEEE Computer Society. Qi Hu received the BS degree in information engineering from Zhejiang University, Hangzhou, China, in 2010. She is currently working toward the PhD degree in information and communication engineering with Zhejiang University. Her current research interests include cache coherence and on-chip interconnects for chip multiprocessor.
Lei Fang
Guofan Jiang received the BS and MS degrees in information science and electronic engineering from Zhejiang University, Hangzhou, China, in 2006 and 2008, respectively. He is currently an advisory engineer in the IBM China System and Technology Lab in Shanghai. He has specialized in design of high-performance ASIC and semicustom chip. His research interests include advanced computer architecture, high-speed serial links, low-power physical design, and advanced verification methodology.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
