Chip Multiprocessors (CMPs) and Simultaneous Multithreading (SMT) processors provide high performance but put more pressure on the memory interface than their single-thread counterparts. The "memory wall" problem is exacerbated by multiple threads sharing a memory interface, and will get worse as more cores are added. Therefore, communications between cores, using shared caches orfast interconnects between private caches, are needed to keep the CPUs busy without burdening the memory interface. Multiple CMP systems add another dimension to this challenging problem, as the communication mechanism is no longer uniform. To parallelize data-intensive applications for high performance on these systems, one must explore a number of execution behaviors in a complex architecturedependent exercise that entails identifying key components of the communication subsystem and understanding their behavior under varying workloads. As part of ongoing research into efficient program execution models for parallel microprocessors, we have developed a tool to evaluate the performance of the storage controllers at different levels of the memory hierarchy under varying workloads and measure cache coherence overhead. The tool allows exploration ofarchitecturalfeatures of real processors that affect the performance of several parallel execution approaches. Here, we demonstrate its use by evaluating two of our parallel programming models that employ architecture-specific optimizations and compare them to a conventional model for several applications on parallel microprocessors.
and Chip Multiprocessors (CMPs), the trend in commercial microprocessor design has shifted from exploiting instruction level parallelism (ILP) and ever increasing clock rates to exploiting thread level parallelism (TLP). SMT processors [20] maintain multiple executable thread contexts in hardware. The threads execute simultaneously and share the execution resources of the physical processor, such as the functional units and the caches. However, each thread has its own instance of the architectural state, which includes the program counter and registers. Chip Multiprocessors [13, 2] are built with several processor cores in a single package. They also share some levels of the memory hierarchy between the cores, but the amount of sharing varies by manufacturer and product. Cores tend to have their own individual first level caches. Some CMPs, such as Intel's Core Duo [13] series share large second level caches between cores, while others, like AMD's Athlon 64 X2 [2] have L2 caches that are private to each core.
The fast interconnects and/or one or more shared levels of the memory hierarchy in these parallel microprocessors present new opportunities for communication between the execution contexts/cores. But these resources could turn out to be potential bottlenecks if there are any hidden costs involved in using them, or if their performance is greatly reduced because of contention. Moreover, because of limited pin-out, or because some of these parallel microprocessors are pin compatible with their sequential predecessors, their memory interfaces tend not to scale with the number of threads. Often, the same memory interface that was already a bottleneck for a single-threaded CPU is used in the parallel microprocessors exacerbating the memory wall problem when multiple threads access memory simultaneously.
Executing parallel applications, particularly dataintensive ones, on these parallel microprocessors involves choosing an execution paradigm that takes into account the capabilities of the memory and cache controllers and the expense of maintaining cache coherence in the system. For example, we have shown that for important classes of applications, using the conventional parallel partitioning (the Spatial Decomposition Model, or SDM) tends to overwhelm the cache and memory interface on SMT processors, thus sometimes causing them to run slower than the sequential version of the programs [28] . This effect can also be seen in Figure 1 , which compares the instructions per cycle (IPC) rate achieved by the Sequential (Seq) and SDM versions of an extremely memory-intensive Finite Difference Time Domain (FDTD) electromagnetic simulation running on a 3 GHz Intel Pentium 4 and a 2.4 GHz dual-core AMD Athlon X2. As long as the problem size fits in the 512K L2 cache, the SDM version has higher aggregate IPC than the Seq version because it exploits parallelism, but it slows down even more than Seq once main memory accesses are needed for larger problem sizes. This performance degradation occurs because of the steeper memory wall under SDM due to simultaneous accesses from multiple threads. Thus, ineffectively parallelized code can actually run slower on a parallel mricroprocessor.
To alleviate the problem, we proposed the Synchronized Pipelined Parallelism Model (SPPM), an execution paradigm that uses the shared cache as a high-speed communication channel between producer and consumer pairs. SPPM carefully manages the threads' execution so they communicate via the cache, which significantly reduces the burden on the memory interface and better exploits concurrent execution on parallel microprocessors. Our results [28] The paper is organized as follows: Section 2 discusses related research. Section 3 defines the SDM and SPPM models and identifies potential bottlenecks that may affect their performance. It introduces PolyThreads as an improvement over SPPM for architectures with private caches. Section 4 describes our benchmarks and enumerates the target architectures. Section 5 presents the results of architecture evaluation and explains how the programming models are impacted, while Sectionu 6 comuipares performanLce of SDM, SPPM, and PolyThreads on the target systems. Finally, Section 7 describes our ongoing research into parallelization of applications for parallel microprocessors.
Related Research
SPPM was inspired by I-Structures [4] used in data-flow architectures for fine-grained producer-consumer synchronization. Others have proposed using an I-Structure-like Synchronization Array [25, 24] to decouple producer and consumer threads for pointer chasing code, and target the exploitation of extremely fine-grained parallelism whereas we seek to target much coarser-grained parallelism.
In the past, pipelined computation was the basis for systolic arrays [16, 3] and vector processors Cray-I [26] . More recently, stream processing has been a topic of research in both the industry and the academia. For example, the StreamIt programming language and compilation infrastructure [12, 1 1 ] was developed to characterize large streaming applications and efficiently map their concurrent kernels to a number of target architectures.
Locality enhancement techniques [21] are software optimizations applied with the objective of improving the behavior of memory intensive applications. By making maximal usage of data fetched into the cache belFore it gets evicted, these optimizations improve both the spatial and temporal locality of applications. For example, tiling [29, 6, 7] , also known as blocking [15, 18] , transforms loop nests whose working set sizes far exceed the cache size to imprnove the temproral localit by working on smaller tiles of data. Tbe transformation causes a tile to be fully used before it is evicted from the cache.
Research into cache partitioning to avoid interference and to improve shared cache performance on SMT processors for scientific codes witb perfect loops [23] uses tiling and copying to reduce capacity and conflict misses. However, modifications to the operating system or access to performance counters are needed to detect interference.
Cache coherent shared memory machines have been around for decades, and nearly every paper evaluating such architectures contains some performance estimations or measurements of the timing or bandwidth. Early machines such as DASH [19] and Flash [17] significantly influenced how future scalable cache coherent machines were designed. One effort to measure NUMA system performance is found in [9] . The authors develop a tool to evaluate link throughput between caches in scalable NUMA machines. Somewhat less was done with commodity, busbased SMP systems. For example, the authors of [f14] evaluate the Pentium Pro processors' performance in great detail but much less detail is provided regarding cache coherence behavior and timing. lmbench [22] provides much detail on many performance m1etrics, including support for multithreading contention. However, unlike C2CBench, it does not measure cache-to-cache performance.
Another performance evaluation approach is to use the performance counters available on modern CPUs. The Performance Application Programming Interface (PAPI) [5, 10] tools provide a standardized interface to counters in commodity architectures running Linux. Oprofile [f] is a tool that uses performance counters to profile applications, though with less control than PAPI provides. Both tools aided our efforts to understand SPPM behavior.
Parallel Program Execuition Models
The Spatial Decomposition Model (SDM) exploits data parallelism by partitioning data and associated computation across the available processors. It is relatively easy to program for suitable algorithms and is widely used by both programmers and parallelizing compilers. SDM threads work on their respective data partitions independently of each other, causing the already bandwidth-limited shared memory interface on CMP and SMT processors to be further overburdened with a large number of simultaneous accesses. When the application's working set is larger than the cache, SDM induces a large number of capacity misses that further degrade performance. On systems with shared caches, SDM threads may also cause cache interference, evicting each other's data from the shared cache.
The Synchronized Pipelined Parallelism Model (SPPM) [28] was developed to reduce the demand on the memvory bus b restructuring suitable programs into producer/consumer pairs that communicate through the cache rather than the memory For architectures that do not provide a shared cache, the presumedfast cache coherence mechanism should still provide a higher bandwidth communication channel than the main memory, which turns out not to be the case (see Section 5). Restructuring suitable applications to exploit producerconsumer parallelism is often more complex than using SDM, though some streaming applications are naturally suited to producer-consumer parallelism. Work is underway to ease the programming of SPPM applications (see Section 7). Depending on the architecture, the producer and consumer may communicate using either a shared cache or the cache coherence interconnect connecting the cores/processors. In either case, the number of consumer's memory accesses is greatly reduced, with aggregate memory bandwidth similar to that of the sequential program.
Polymorphic Threads model was developed because SPPM did not perform well on CMPs with private caches.
While PolyThreads is derived from SPPM, the communication paradigm is entirely different. Each polymorphic thread contains both the producer and consumer code, which operate on blocks of data, with each thread operating on alternating blocks. When a thread's producer code is finished with a block, it signals the other thread's producer to begin work on the next input block. It also transforms itself into a consumer for the data just produced by itself as a producer as illustrated in Figure 3 Because the consumer uses data Just produced by its producer persona it finds the data in its local cache, thus no large cache-to-cache transfers are needed. This allows concurrent execution of polymorphic producer-consumer threads, largely Figure 3 . SPPM and Polymorphic Threads transferred between caches at block boundaries. The block size can be tuned based on the data access pattern and the cache size to achieve the best performance. Polymorphic Threads greatly benefits SMPs and CMPs that have private caches because cache coherence operations always incur some cost. However, it does not yield any benefit on SMT processors and CMPs with one or more shared cache levels, because producer-consumer communication is already as fast as possible and the overhead of synchronizing and switching modes means it runs slightly slower than SPPM on such processors. In addition, PolyThreads programs are more complex and difficult to develop because of the fusion of producer and consumer code with the transformation at a synchronization point in the middle. The added complexity and slight additional overhead means that it should only be used when needed to overcome the cache coherence cost of processors having private caches. As described in Section 7, efforts are underway to reduce the burden of programming such code.
Benchmarks and Target Arch'itectures
We have currently hand-coded the following four applications as benchmarks for our research. These applications are real-world segments, yet are simple enough to isolate the memory behavior and discover performance differences between the various parallel execution approaches.
Red is used by a pseudo-random number generator to generate a keystream of pseudo-random bits. As each byte of the keystream is generated, it is XORed with the corresponding byte of the plaintext to generate a byte of the ciphertext. Because the generation of each byte of the keystream is dependent on the previous internal state, the process is inherently sequential and, thus, non-parallelizable using SDM. By treating the keystreami generationi and the productionu of ciphertext as producer and consumer, SPPM and PolyThreads can exploit the inherent concurrency.
The Pipelined Equation Solver (EQN): The pipelined equation solver (EQN) solves the same sort of problem as Red-Black, described above. Instead of restructuring the original algorithm with its intra-loop dependences, it is pipelined by executing its iterations concurrently. As an iteration produces semantically enough data, a new iteration is spawned on another processor, if available. This forms a linear chain of processors with each acting as a producer for the next one in the chain and as a consumer for the previous one. This form of pipelined computation leads to a few iterations toward the end being executed speculatively.
In our implementation, the processor that achieves convergence first signals the other processors to stop. This benchmark, like ARC4 above, is representative of classes of applications that are not parallelizable at all using SDM. Moreover, it demonstrates one way that SPPM can scale with the availability of more than two hardware threads.
Our benchmarks were run on the following mix of CMP and SMT processor based systems. On the P4, Intel's C++ Compiler 7.0 was used, while GCC 4.0 was used on all the other systems. The benchmarks were compiled using the -03 optimization flag on the P4, Opteron and Athlon and using the -fast optimization flag on the Xeon and Core Duo. The performance toolL was compiled using the -0 optimization flag on all the systems.
Architecture Evaluation Results
Our memory hierarchy communications performance tool (C2CBench) measures the performance of various levels of the memory hierarchy under different workload conditions. It can determine relative throughput and latency of accessing local and remote caches and memory with and without interference from other threads. C2CBench is based on the SPPM runtime system, which provides producer and consumer threads that perform specified operations (reads or writes) on the elements of a data set divided into blocks. Using appropriate values for the data set size and block size, one can test the performance of the Ll cache, L2 cache, and memory controllers for either local or remote accesses. The producer and consumer can be configured to either work in lock-step manner or concurrently.
l When working in lock-step, the producer processes a block of data while the consumer waits for it to finish, thus introducing no interference. In turn, the producer waits for the consumer to finish processing that block before proceeding to the next one. This allows us to measure the baseline performance with no contention for access to the storage controllers. * When working concurrently, the consumer processes a block of data previously processed by the producer while the producer is processing the next block of data. Working concurrently puts additional burden on storage controllers and allows us to measure the degradation in their performance due to contention.
C2CBench also controls buffer placement on NUMA architectures by using CPU affinity and Linux's data placement policies, so the buffer can be placed in physical memory attached to a specified processor This allows us to measure the effects of local and remote data placement. Figure 4 shows the time in microseconds taken to read each cache line in a 32M block of data from the memory On all the architectures, the performance decreases as contention increases. On the Opteron, however, the performance is the same for both 2 and 4 threads because each chip in the system has its own integrated memory controller. This test models the simultaneous accesses to the memory interface in applications parallelized using SDM. The figure clearly shows that memory intensive SDM programs experience degraded performance due to excessive contention for the memory interface. Figure 5 Figure 6 . L2 Cache Coherence Overhead to read from remote memory than from a cache on the other chip. Much of this is due to the fast on-chip memory controllers in these architectures, but the poor cache coherence performance is unexpected. Because SPPM uses the cache to communicate, its performance on these chips is worse than using the memory. Thus SDM actually works better, though it may be slower than sequential code. This test is a useful indicator of SPPM's performance on a target architecture, since many of the consumer's data accesses are satisfied from the producer's cache. 
Memory Interface Performance

Memory and Prirvate L2 Performance
5,3 L2 Cache Coherence Overhead
LI Cache Coherence Overhead
This test measures the cache coherence protocol overhead between Ll caches on shared L2 cache architectures (in our case, Core Duo and Xeon). Figure 7 shows the processing times for each 64K block of a 256K data set read by the consumer over three iterations of the data set on the Core Duo and Xeon. The processing times of the individual blocks are cumulatively added across an entire iteration, which gives rise to the saw-tooth like waveform. The tests are run with the producer and consumer running concurrently (CO) and in lock-step (LS). The size of Level f caches on both architectures is 32K, and they apparently use a write-back policy. When executing in lock-step with the producer, the consumner always finds the first 32K of a 64K hlock in the shared Level 2 cache, where it must have heen evicted from the producer s cache to accommodate the last 32K of the hlock. The last 32K of the 64K hlock must, therefore, be read from the producer's Level I cache, apparently by first causing it to be written back to the shared Level 2 cache. As a result, reading the last 32K is more expensive than reading the first 32K, and this additional expense has to be incurred for every block read by the con- sumer. This explains the linear waveform corresponding to lock-step execution. However, when the producer and consumer execute concurrently, the producer does not wait for the consumer to finish reading every block, and keeps evicting newer blocks to the Level 2 cache even as the consumer is reading older blocks that were evicted before. In this case, all but the last 32K of the last block are evicted from the producer's Level 1 cache, and as can be seen from the figures, the consumer processes them very fast. But reading the last 64K block takes as much time as when executing in lockstep, which causes the spike seen in its waveform. Overall, the consumer performs better when executing concurrently than when executing in lock-step. This test shows that extremely fine-grained data sharing between threads can harm performance, so it is important that the granularity be larger than the Ll cache size. We show this behavior in SPPM for the Red-Black benchmark in Section 6.
Programming Model Evaluation Results
We now present the performance measurements of the Red-Black, FDTD, ARC4, and EQN applications using the SDM, SPPM and Polymorphic Threads. As we demonstrated in Figure 7 , on shared architectures with private Li caches, there is an overhead involved for maintaining cache coherence when data is dirty in the LI caches. This overhead penalizes applications with extremely fine-grained data sharing. In Figure 9 , we vary this granularity for a problem size of 1000 x 1000 in the SPPM version of the Red-Black equation solver on the Xeon and the Core Duo. Granularity is measured in number of rows of the input array i.e. 1000 data elements (r 8KB of double precision floating point numbers). The size of the Ll cache on both the Xeon and Core Duo is 32KB. For granularities finer than 4 rows (_ 32KB), the consumer incurs the high overhead of the LI cache coherence protocol while fetching data from the producer's LI cache, causing the overall execution time to increase. As the granularity is inLcreased to greater thad 4 rows, fewer data elLements are fetched from the producer's LI cache by the consumer, so the execution time decreases, hence performance increases (more pronounced on the Xeon than on the Core Duo). Al- Because SPPM makes better use of the cache, it performs better than SDM in almost all cases, but is particularly effective on shared cache architectures. Because FDTD accesses so many arrays per iteration, it is extremely data-intensive, thus the SDM version tends to run much more slowly than the sequential version on the P4. Despite the sharing of resources by HyperThreading on the P4, SPPM still manages to extract a significant performance improvement. SPPM also achieves nearly perfect speedup on the Core Duo and Xeon. SPPM-SC ekes out a performance advantage over SDM on the AMD chips, but is not near a perfect speedup, while SPPM-DC is as slow as the sequential version or worse. PolyThreads, which reduces the cache-to-cache transfers, does provide better speedup on Athlon and Opteron, though still not as good as on the Intel chips It is slower than SPPM on the shared cache architectures hecause of its slight overhead. (key generator) does much more work than the consumer (XOR the streams). SPPM achieves significant speed-up on shared cache architectures, but incurs a performance penalty on the ones with private caches due to the high cache coherence overhead. Even PolyThreads incurs a penalty on the AMD architectures. Because of the high memory bandwidth on these architectures, the sequential version performs roughly as well as PolyThreads. Moreover, the fact that the consumer does much less work than the producer naturally limits the speedup achievable in this specific case as can be seen from the figure. Nevertheless, this algorithm represents a class of applications not parallelizable using SDM but which can be run in parallel using SPPM and PolyThreads. Figure 12 shows the normalized execution times of EQN on the target architectures for varving number of threads The execution times are normalized to the execution time using a single thread (1T) 2T 
Red-Black Equation Solver
Pipelined Equation Solver (EQN)
