In this work we introduce power optimizations relying on partial tag comparison (PTC) in snoop-based chip multiprocessors. Our optimizations rely on the observation that detecting tag mismatches in a snoop-based chip multiprocessor does not require aggressively processing the entire tag. In fact, a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits.
Introduction
The continuous downscaling of transistor dimensions together with limitation on programs instruction level parallelism has popularized shared-memory chip multi-processor (CMP) architecture as an effective solution. On a CMP, cores communicate through shared variables and based on cache coherence protocols. Cache coherence protocols facilitate propagating the recently updated values to all concerning caches [1] . In addition, cache coherency provides cores with the latest value of the requested shared variables. The delay associated with the coherency mechanisms postpones shared variable updates and read operations. Accordingly, one way to enhance the overall CMP performance is to speedup the coherence process.
To reduce coherency delay, commercial small-scale [2] and possibly larger CMPs [3] exploit Snoopy Cache Coherence (SCC) protocols. SCC protocols take an aggressive approach and broadcast memory requests to all cores in the system. Unfortunately, SCC protocols impose high interconnect bandwidth demand and frequent unnecessary remote cache searches [3] . Previous research has introduced different approaches to solve the above problems. One possible approach exploits snoop filters to eliminate useless interconnect and memory activities [4]- [7] . Snoop filters come in two classes: source-based and destinationbased. In source-based filters [4,5] each node decides locally, but based on global knowledge, to broadcast a message or not. While this approach can eliminate some unessential traffic, it cannot stop delivering messages or prevent cache lookups in nonconcerning processors when it attempts to broadcast. Destination-based filters, however, focus on eliminating unnecessary lookups at destinations [6, 7] . On the contrary to source-based filters, these filters rely on local knowledge to determine whether lookup is necessary or not. They take advantage of the snoop request access pattern locality [6] or bloom filters [7] to eliminate non-required lookups.
We extend previous work by using partial tag comparison (or simply PTC) in snoopbased chip multiprocessors. We rely on the observation that a considerable share of tag mismatches could be avoided by comparing a subset of tag bits, making an entire tag comparison unnecessary. We take advantage of this phenomenon and store a small number of tag bits for tags recorded in all cores in the source node to facilitate early mismatch detection. Prior to sending a snoop request, we compare the subset of address tag bits to those stored in the source node and avoid sending the snoop request to nodes showing a mismatch.
It should be noted that there are two classes of coherency cache misses: global and local. A global miss occurs when the requested address is missed in every remote cache. In the case of a local miss, while one or more cores miss the requested data, there is at least one remote cache that has a copy of the requested block. Previous suggested source-based filters focus on global misses. As we show in this paper, our proposed source-based PTC-based mechanism detects both global and local misses increasing the power reduction opportunities.
Using a small number of tag bits makes early cache miss detection possible. This results in performance improvement for some of the applications studied here. Therefore, while previously suggested techniques often save power at the expense of performance, we improve performance and power simultaneously for some applications.
In summary we make the following contributions.
-We show that it is possible to maintain cache coherency by using only a small number of tag bits. Our study shows that it is possible to detect, on average, between 95% to 98% of global and local remote misses by taking into account only the eight lower bits of the requested tag in different CMP configurations. -We propose source-based PTC (or S-PTC). S-PTC relies on storing a snapshot of the storage components involved in snooping at the source-side. S-PTC reduces interconnect bandwidth requirement (78.5% to 81.9%) and tag array dynamic power (52%) while improving average performance up to 3.5%.
The reset of the paper is organized as follows. In section 2 we discuss background. In section 3 we present our motivating findings. In section 4 we discuss S-PTC in more details. In section 5 we present methodology and results. In section 6 we discuss related work. Finally, in section 7 we offer concluding remarks.
