Srinivas Devadas, MIT
I
ncreasing transistor density has prompted hardware designers to turn to a tiled architecture, which can consist of chip multiprocessors (CMPs) with 16 or more cores. The architecture features arrays of replicated tiles that communicate through on-chip interconnect. Under a two-level cache hierarchy, each tile contains a processor with its own L1 cache and a slice of the L2 cache. To maximize effective on-chip cache capacity, physically distributed L2 cache slices form a nonuniform cache access (NUCA) architecture-essentially one large, logically shared cache. 1, 2 In this shared L2 organization, the address space is divided among cores such that each address is assigned to a unique home core, where the data corresponding to the address can be cached at the L2 level. Because current CMPs use private L1 caches, data at the L1 level can be replicated across any requesting core. To provide a unified shared-memory abstraction, many-core systems commonly maintain private cache coherence through a coherence protocol and distributed directories. However, implementing and verifying distributeddirectory cache-coherence protocols for many-core CMPs is difficult and error-prone. These protocols also result in suboptimal performance when a thread must access large amounts of data distributed across the chip because data must be brought to the core where the thread is running, which incurs delays and energy costs.
To address this problem, we have developed a novel migration-prediction scheme that decides at instruction granularity whether to perform a remote access or thread migration, and which part of the thread context-which registers-to migrate to further reduce migration cost.
As in a NUCA design, our scheme uses a shared L2 cache. However, NUCA designs rely on private L1 caches and allow replication at the L1 cache, which requires a coherence protocol and, usually, indirection via directories. In sharp contrast, our scheme allows no replication across L1 caches (except for read-only instructions), obviating the need for both the directories and the coherence protocol. To implement our scheme, we have developed a directoryless shared-memory architecture that complements remote accesses with judicious thread migrations 
RESEARCH FEATURE
and built a silicon prototype: the Execution Migration Machine (EM 2 ), a 110-core CMP in a 45-nm ASIC that supports hardware-level thread migration on a stack-based core architecture. 3 Evaluation results show that our hybrid architecture improves performance and significantly reduces network traffic relative to architecture with remote access only. In simulations comparing our design with a state-of-theart directory-based architecture, our hybrid architecture performed better for a certain class of applications, and is competitive overall.
DRAWBACKS OF DISTRIBUTED-DIRECTORY PROTOCOLS
A conventional tiled architecture typically distributes data across multiple shared cache slices to minimize expensive off-chip accesses, especially when multiple threads share large data structures that do not fit in a single cache or when a single thread iteratively accesses multiple structures. In such scenarios, a thread must access data mapped to remote caches, often with high spatiotemporal locality, which results in heavy on-chip network traffic.
For example, a database request might result in a series of phases, each consisting of many accesses to contiguous data stretches. Each request will typically run in a separate thread, pinned to a single core throughout its execution. Because this thread might access data cached in remote shared cache slices, however, the data must be brought to the core where the thread is running. As Figure 1a shows, in a directory-based architecture, the data would be brought to the core's private L1 cache, only to be replaced when the next request phase accesses a different data segment. In Figure 1b , the thread follows the data, replacing data transfers. If the thread execution context (the architectural state) is small relative to the data that would be transferred, moving the thread can significantly reduce on-chip interconnect traffic.
One reason that distributeddirectory coherence protocols are so difficult to implement and verify is that the design of even a simple coherence protocol is not trivial. The response to a given request is determined by the state of all system actors, by transient states due to indirections (such as cache-line invalidation), and by transient states due to the nondeterminism inherent in event timing. Because the state space explodes exponentially with the number of distributed directories and cores, it is virtually impossible to use simulation or formal methods to cover all verification scenarios. 4 Unfortunately, verifying small subsystems does not guarantee the entire system's correctness. 5 In modern CMPs, cachecoherence errors are a leading bug source in post-silicon debugging. 6 A straightforward approach to removing directories while maintaining cache coherence is to disallow cache-line replication across on-chip caches-even L1 caches-and to use remote wordlevel access to load and store remotely cached data. 7 In this scheme, every access to an address cached on a remote core becomes a two-message round trip, but because only one copy is ever cached, coherence is ensured. Even so, this remote-access-only scheme is still susceptible to the data-access patterns shown in Figure 1a , which degrade performance and increase network traffic.
ELEMENTS OF A HYBRID DIRECTORYLESS ARCHITECTURE
As Figure 2a shows, our architecture combines remote cache access with hardware-level thread migration in a hybrid directoryless architecture. Using fine-grained hardware-level thread migration to complement remote accesses more efficiently exploits data locality 8, 9 because accesses to data cached at a remote core can cause the thread to migrate to that core and continue execution there. When several consecutive accesses are made to data at the same core, thread migration allows those accesses to become local, potentially improving performance over a remote-access regimen. Each access to data cached on a remote core can either be a remote access or a migration of the current execution thread. Our migration predictor makes this decision on an instruction-by-instruction basis, rec- (a) A directory-based scheme requires a round trip for each cache line mapped to the remote L2 cache and misses in the private L1 cache, whereas a remote-access-only scheme requires a round trip to the remote L2 cache with each word access. (b) Migrating a thread to the data locus enables local data access, essentially turning the round trips in (a) into a series of migrations followed by long stretches of accesses to locally cached data.
RESEARCH FEATURE
ommending migration only when replacing multiple remote accesses would be worth the cost. In addition, only a few registers are used between the time the thread migrates out and returns, so not migrating unused registers can further reduce transfer costs. (Because instructions are read-only, they can be replicated without worrying about coherence. Thus, threads need not perform a remote access or migration to fetch instructions.)
Remote cache access
In standard NUCA remote-access schemes, 2,7 all nonlocal memory accesses cause a request to be transmitted over the interconnect. The access is performed in the remote core, and the data (for loads) or acknowledgement (for writes) is sent back to the requesting core. When a core executes a memory access for an address, it must first find the home core for that address (for example, by consulting a mapping table or masking some address bits). If the home core is the same as the executing core, a core hit occurs and the request is served locally at the executing core. If two cores are different, a core miss occurs and the remote-access request must be forwarded to the home core, which will send a response back to the executing core when the request's execution is complete. Unlike a private cache organization, in which a coherence protocol exploits spatiotemporal locality by making a copy of the block containing the data in the local cache, this protocol incurs round trip costs for every remote access to a word.
Thread migration
In a NUCA architecture with finegrained, hardware-level thread migration, the thread comes to the data instead of the other way around. 9 Figure 2c illustrates this idea. When a core running a thread executes a memory access for an address, it must first find the address's home core. If the executing and home cores are the same, the request is served locally at the executing core, but if they are not, the hardware interrupts the thread's execution, puts the thread's execution context (or microarchitectural state) into a network packet, and sends the packet to the home core through the on-chip interconnect. At the home core, the packet is loaded and the thread's execution resumes.
Relative to thread migration approaches that require operating sys- tem intervention or memory accesses, our migration scheme is faster because it takes place directly over the interconnect. A register mask allows partial-context migration-loading and unloading selective parts of the register file. If another thread is already executing at the destination core, it must be evicted and moved to a core so it can continue running. To reduce the need for evictions, cores duplicate the architectural context, which enables a core to multiplex execution among two concurrent threads. To prevent deadlock, one context is marked as the native context and the other as the guest context. A core's native context can hold only the thread that started execution there-the thread's native core. Evicted threads must return to their native cores to ensure deadlock freedom. 8 Context size directly affects migration overhead. The relevant architectural state that must be migrated in a 64-bit x86 processor amounts to about 3.1 Kbits (sixteen 64-bit general-purpose registers, sixteen 128-bit floating-point registers, and a few special-purpose registers). 10 Serialization latency occurs because the full context must be loaded into (or unloaded from) the network. With a 128-bit flow-control digit (flit) network and a 3.1-Kbit context size, the thread context occupies 26 flits, which incurs a serialization overhead of 26 cycles. Context size depends on the architecture; in the TILEPro64, 11 for example, it is about 2.2
Kbits (sixty-four 32-bit registers and a few special registers). Another migration overhead source is pipeline-insertion latency. A memory address is computed in the middle of the pipeline, so if a thread ends up migrating to another core and re-executes, the core must refill the pipeline, which we assume takes 10 cycles.
Thread migration predictor
With a large context size, per access, thread migration costs exceed the cost of remote-access-only designs. On the other hand, migration involves multiple contiguous memory accesses to the same core-after the first access, the remaining accesses are local, which considerably offsets the per-access cost. Relative to predictors that support only full-context migration, 12 our predictor looks for possible partialcontext migration-opportunities to send only part of the register file when a thread migrates. With a deadlock-free migration framework, 8 the native-core register file remains intact even if a thread migrates away because no guest threads use its context. Consequently, only the registers to be read during the trip are carried, and only registers written while away are brought back. Because the decision to migrate with partial context or use remote access must be made for every memory access, the data structure must be efficient. As Figure 2b shows, our per-core migration predictor is based on a PC-indexed and direct-mapped data structure. The underlying ideas are that sequences of consecutive memory accesses to the same home core and register-use patterns within those sequences are highly correlated with the instruction flow, and that these patterns are fairly consistent and repetitive across program execution.
Our baseline configuration uses a 128-entry predictor, each of which consists of a 64-bit PC and a 32-bit useful register mask-about 1.5 Kbytes total. An N-bit mask is required for architecture with N general registers; each bit indicates whether the corresponding register is sent during migrations.
If the home core for an address is not where the thread is currently running (a core miss), the predictor decides whether a remote access or migration is preferable. Thread migration occurs if the instruction's PC hits in the migration predictor; otherwise, the predictor instructs the thread to perform a remote access. If the thread is migrating from its native core to another core, it transfers only registers with set bits in the register mask. Thus, the predictor's two main tasks are to track the number of consecutive accesses to the same core, which tells it when to migrate, and to track used registers within those accesses, which tells it what to migrate.
Detecting when to migrate. If the consecutive access count exceeds a threshold, a PC is inserted into the predictor;
OUR PREDICTOR ALSO LOOKS FOR OPPORTUNITIES TO SEND ONLY PART OF THE REGISTER FILE WHEN A THREAD MIGRATES.
otherwise, the instruction is classified as remote access-the default state. Each thread tracks three fields:
› Home, which maintains the home core ID for the most recently requested memory address;
› Depth, which indicates how many times a thread has contiguously accessed the recent home location (the Home field); and › Start PC, which tracks the PC of the first instruction among memory sequences that accessed the home location in the Home field.
The Depth threshold θ indicates the depth at which the instruction is considered migratory. When a thread T executes a memory instruction for address A whose PC = P and the home core for A is H, the detection mechanism must adhere to certain rules:
› If Home = H (current and previous memory accesses are to the same home core) and Depth < θ, increment Depth by one.
› If
Home ≠ H (a new sequence starts with a new home core) and Depth = θ, Start PC is considered a migratory instruction and inserted into the predictor.
Home ≠ H and Depth < θ, Start PC is considered a remote-access instruction.
After each decision, the predictor resets the Home, Start PC, and Depth fields to H, P, and 1.
Detecting what to migrate. In addition to tracking the Home, Depth, and Start PC fields, each thread also tracks Used Registers, a 32-bit vector in which each bit indicates whether the corresponding register has been used within a sequence of memory instructions accessing the same home core. Every instruction (both memory and nonmemory) updates the Used Registers field by setting the bit when the corresponding register is being read or written. When the PC is detected as a migratory instruction and inserted into the predictor, Used Registers is inserted with Start PC, as shown in Figure 2b . Figure 3 shows an example of the detection mechanism when θ = 2. Suppose a thread executes a sequence of instructions I 1 through I 5 . I 1 , I 3 , and I 5 are memory instructions; I 2 and I 4 are nonmemory instructions; and rn denotes the nth register. When I 1 is first executed, the entry {Home, Depth, Start PC, Used Registers} will hold the value {C, 1, PC 1 , r1}. Then, when I 2 , which uses r2 and r3, is executed, the Used Registers bit-vector is updated to set the bits for r2 and r3. I 3 accesses the same home core (C), so the Depth field is incremented by one. I 4 simply adds r4 to the register bit-vector.
Finally, when a new memory sequence for the home core A starts with I 5 , the processor adds PC 1 , the Start PC of I 1, to the predictor. I 1 is classified as a migratory instruction because C's Depth has reached the threshold.
The policy for partial-context migration occurs when a thread T executes a memory instruction whose PC hits in the migration predictor and thus needs to migrate. It has three conditions: › If T is migrating from its native core to a non native core, it takes the registers specified in the migration predictor's useful register mask, as part 1 of Figure 3b shows. Migrates with r1, r2, r3
Migrates with r1
Migrates with r1, r2, r3
Migrates with r1 brought when it first migrated from its native core, as part 2 of Figure 3b shows.
› If T is migrating back to its native
core from a nonnative core, it takes only the registers written while T was outside its native core, as part 3 of Figure 3b shows.
Any special-purpose registers required for execution and with per-thread instances, such as rip, rflags, and mxcsr for a 64-bit x86 architecture, are always transferred.
Register bit masks.
To implement these policies, a thread carries around two 32-bit masks. V-mask identifies the registers that the thread can access while outside its native core, which it looked up in the predictor when the thread first migrated from its native core. W-mask, which tracks the registers that have been written while the thread was outside its native core, implements the third partial-context migration policy. Because a register file remains intact in the native context, a thread returning to its native core needs to carry only modified registers.
During migrations, the V-and Wmasks and {Home, Depth, Start PC, Used Registers} must be transferred with the context. With 64 cores and a maximum of θ = 8, a total 169 bits must be transferred: V-mask (32 bits) + W-mask (32 bits) + Home (6 bits for 64 cores) + Depth (3 bits when θ = 8) + Start PC (64 bits) + Useful Register Mask (32 bits).
Unlike its decision about performing a remote access or thread migration, the thread consults the useful register information in the predictor only when the thread is at its native core. The native context is the only place where all the register values are maintained for the thread, and once it leaves the native core, the thread cannot use any registers other than the ones it initially brought from its native core (the registers in the V-mask). If, while outside its native core, a thread encounters an instruction that requires reading or writing a register not brought from its native core (that is, rn is not an element of the V-mask), a register miss occurs, and the thread stops execution and returns to its native core. Part 4 of Figure 3b illustrates this register-miss migration.
To minimize such migrations, our predictor updates the useful register mask by adding the register that caused the miss when the thread migrates back. With this learning mechanism, the useful register mask for a particular PC, say PC 1 , will eventually converge to a superset of registers that the thread uses until it migrates back to its native core.
SIMULATION FRAMEWORK
To model our architecture, we used Graphite 13 to simulate 64 in-order, single-issue cores with two-way finegrained multithreading in an 8 × 8 mesh (XY routing, 128-bit flits); each core has a 32-Kbyte L1 two-way data cache, a 32-Kbyte L1 read-only four-way instruction cache, and a 128-Kbyte fourway L2 cache (64-byte cache block). We modeled a two-cycle, fixed per-hop latency with extra delays from contention. For data placement, we used the first touch after initialization policy, which allocates each page (4 Kbytes) to the core that first accesses it after parallel processing has started. Our comparison involved several architectures and a distance-based decision scheme, which we abbreviated as To quantify how partial-context migration can reduce traffic, we compared NoDirPred (our basic design) with NoDirPred-Full, the full context migration variant, which always sends the full thread context during migrations. We also compared NoDirPred to NoDirDist to assess our migration predictor's effectiveness. The idea behind NoDirDist is that the round trip remote-access overhead should be low over short distances, so threads should migrate only if the distance to the home core exceeds some threshold d. In our evaluation, we set the threshold to 6, the average hop count for an 8 × 8 mesh, and assumed that the full context is transferred during migrations.
We included a comparison with DirCC to provide a sense of how directory less designs perform relative to conventional designs. DirCC uses the modified/shared/invalid (MSI) protocol with full-map directories in a private-L1 and shared-L2 configuration, and reactive-NUCA data placement, which supports coarse-grained data migration and replication. 1 Because all schemes use the shared-L2 configuration and our benchmark data sets fit on the chip, differences in off-chip access rates are negligible across all the systems we evaluated; the main difference stems from the performance of on-chip cache accesses. Our comparison used several benchmark applications, including › prcn+cv, a perceptron learning algorithm with parallel cross-validation; › dht, a distributed hash table benchmark; and › a set of Splash-2 benchmarks: 14 fft, lu, ocean, radix, raytrace, and water.
Each benchmark ran to completion, and we measured the parallel completion time. We also accounted for migration overhead for our hybrid architecture. Neither NoDirPred nor NoDirRA allows data replication at the hardware level. Read-only data, however, can be replicated without breaking cache coherence-even without directories and a coherence protocol. Shared read/ write data is the primary data type that requires directories to maintain cache coherence, so the performance of directoryless architectures with read-only data replication is a realistic indicator of the cost of removing directory coherence. Consequently, we replicated readonly data for Splash-2 benchmarks by making source-level modifications. 15 Our modifications did not alter the algorithm and incurred code changes of only a few tens of lines for each benchmark. NoDirPred and NoDirRA benefitted from this replication almost equally, so the modifications had no effect on other aspects of their comparison. Figure 4a shows the performance of DirCC, NoDirRA, NoDirDist, and NoDirPred with θ set to 3. Relative to DirCC, NoDirRA's performance is 49 percent worse on average, whereas NoDirPred's performance is only 13 percent worse. NoDirDist has the lowest performance, which underlines the need to make judicious migration decisions. Because instruction-cache (I-cache) content is not transferred during migrations, NoDirPred shows 7 percent more I-cache misses than NoDirRA on average; I-cache miss rates, however, are still less than 0.1 percent-a negligible effect on performance.
PERFORMANCE EVALUATION

On-chip network traffic
We also compared on-chip network traffic in each system, measured as the number of sent flits times the number of traveled hops. Figure 4b shows that NoDirPred reduces network traffic by 28 percent on average compared to DirCC, and by 50 percent relative to NoDirRA. Our comparison also revealed that the network traffic for NoDirDist is five times that of DirCC traffic (not shown).
Although the average performance of NoDirPred is lower than that of DirCC, most of our benchmarks were developed with directory coherence in mind. Parallel cross-validation (prcn+cv) is an example of when directory-based coherence performs poorly. The computation requires each thread to traverse through a dataset spread across the cores, resulting in many accesses to remote caches and high network overhead for DirCC. As a result, NoDirPred outperforms DirCC by 34 percent with 42 times less traffic for prcn+cv. These results demonstrate the feasibility of eliminating such overhead by migrating threads to the data.
L1 cache miss rates
To better understand overall performance, we measured L1 cache miss rates for DirCC and NoDirPred. Figure 4c shows the results. Because cache lines are not replicated across L1 caches in NoDirPred, the effective L1 cache capacity increases, always resulting in lower L1 miss rates than in DirCC. Moreover, although all L1 misses under NoDirPred are forwarded to local L2 caches, a large fraction of L1 misses for DirCC result in memory requests to remote L2 caches. These memory requests contribute significantly to performance degradation and increase network traffic for directory-based architecture.
Core miss rates
On the other hand, directoryless designs can suffer when the core miss rate is high-when the thread must frequently access data cached in remote cores. DirCC's core miss rate is always zero. Figure 4d shows that, on average, 18.4 percent of all memory accesses result in core misses for NoDirRA, and only 6 percent for NoDirPred. The average migration rate is 1 percent, indicating that the predictor works well. Raytrace and water are examples of how NoDirPred suffers both performance loss and network traffic increase because of high core miss rates. Figure 4e shows the results of comparing NoDirPred and NoDirPred-Full. NoDirPred reduces out-migration traffic (or migrations to non native cores) by 58 percent and back-migration traffic (or migrations back to native cores) by 72 percent, relative to NoDirPred-Full. Our predictor (specifically the Useful Register field) is largely responsible for the reduction of out-migration traffic, and the W-mask's ability to track written registers helped reduce back-migration traffic.
Partial versus full context
Although using partial-context migration occasionally induces unnecessary migrations because of register misses, we observed almost no associated overhead because our predictor learns from each miss and adds the missing register to the useful register mask for the appropriate PC. With this union mechanism, however, the register mask will only grow, which makes our context prediction conservative. Thus, some of the registers that are migrated might not actually be used. Across all benchmarks, approximately 75 percent of migrated registers are actually used on average, as Figure 4f shows. We view this as evidence that our predictor is reasonably efficient. Figure 5 shows the relative performance and network traffic of NoDirPred maintained over different network parameters. Figure 5a shows that our architecture outperforms NoDirRA by 29 percent with a three-cycle perhop latency. The figure does not show results with a two-cycle per-hop latency, which yielded an improvement of 24 percent. The decrease is due to the round trip nature of remote accesses, which suffer more from increased perhop latency. With a 64-bit flit network instead of 128-bit, on the other hand, the network traffic reduction of NoDirPred over NoDirRA decreases from 50 percent to 43 percent. This is because a large fraction of remote-access messages (that do not carry a data word) fit into 64 bits and do not need additional flits to make up for the halved bandwidth. Performance improvements also drop, but not significantly.
Varying network parameters
Area and power costs
Our hybrid architecture requires an extra architectural context for the guest thread and a learning migration predictor. It also requires four independent onchip networks for migrations and evictions as well as remote-access requests and responses. In contrast, a deadlockfree implementation of directory-based coherence requires three networksone each for coherence requests, replies, and invalidations.
To get an idea of how these costs compare against those of a directory-based DirCC NoDirRA NoDirPred (a) (b) Figure 5b shows the comparison results. Predictably, the SRAM blocks (instruction and data caches as well as the DirCC directory) were responsible for most of the area in all variants. Overall, the area required for the directory in both the 50-and 100-percent MESI versions outweighed the extra thread context and extra router in NoDirPred. We view this as evidence that NoDirPred can reduce area costs.
Verification complexity
Distributed cache-coherence protocols are hard to verify because the state space grows exponentially with core number. Even relatively simple protocols, such as MSI and MESI, introduce many transient states that are not explicit in higher-level protocols, 5 and writing test benches that exercise all the reachable transient states are not practical. Consequently, exploring the state space requires significant modeling simplifications, 16 but even formally verifying a given protocol on a few cores is not enough to provide confidence that it will work on 100 cores. Directoryless designs such as NoDirRA and NoDirPred have a significant advantage over directory-based designs. A given memory address can be cached only in a single place, so a memory request will depend only on the validity of a given line in a single cache, and indirections or transient states are not required. Unlike caches and directory entries in DirCC, directoryless caches do not keep information about more than the local core; the VALID and DIRTY flags that together determine a cache line's state are local to the tile.
Moreover, thread migration does not introduce additional complications because the data cache is not concerned with the origin of local memory requests (whether from native or migrated threads), but instead uses the same access interface for all requests. All of the logic required for the migration frameworkdeciding whether to migrate, computing the destination core, serializing and deserializing network packets from and to the execution context, evicting a running thread if necessary, and so on-are also local to the tile.
As a result, we can cleanly separate overall correctness into four categories, each of which can be reasoned about separately:
› remote-access framework, › thread migration framework, › cache that serves the memory request, and › on-chip interconnect.
Furthermore, the entire state space can be exercised in the four-tile system, so it is possible to scale the system to an arbitrary core number without incurring an additional verification burden.
O ur evaluation has shown that, for certain applications, a directoryless architecture with fine-grained partial-context thread migration is superior to a directorybased coherence protocol. By lowering network traffic levels in sharedmemory architectures and reducing the verification complexity of directorybased coherence protocols, our hybrid directory less architecture provides a new design point in the spectrum of hardware-coherence support. The lack of data replication can limit its performance benefits, but we plan to explore approaches to avoid this limitation, such as implementing thread migration on top of simplified hardware or software coherence protocols.
