As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors. In this paper, performance measurement on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64 Â 2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dualcore processors. In addition to the overall execution time and throughput measurement using both multi-program-med and multithreaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for better performance and scalability, it is important to have (1) fast cacheto-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as the shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4 without taking the advantage of on-chip communication.
Introduction
Due to advances in circuit integration technology and performance limitations in wide-issue, super-speculative processors, chip-multiprocessor (CMP) or multi-core technology has become the mainstream in CPU designs. It embeds multiple processor cores into a single die to exploit thread-level parallelism for achieving higher overall chiplevel instruction-per-cycle (IPC) [4, 10, 11, 31, 32] . Combined with increased clock frequency, a multi-core, multithreaded processor chip demands higher on-and off-chip memory bandwidth and suffers longer average memory access delays despite an increasing on-chip cache size. Tremendous pressures are put on memory hierarchy systems to supply the needed instructions and data timely.
In this paper, we report performance measurement results on three available dual-core desktop processors: Intel Core 2 Duo E6400 with 2.13 GHz [11] , Intel Pentium D 830 with 3.0 GHz [15] and AMD Athlon 64 Â 2 4400+ with 2.2 GHz [2] . The Core 2 Duo E6400 was manufactured using 65 nm technology with 291 million transistors [11] , while the Pentium D 830 and the Athlon 64 Â 2 4400+ were manufactured under 90 nm technology with about 230 million transistors [1, 25] . In contrast to existing performance studies [9, 23, 24] that usually provide overall execution time and throughput, this paper emphasizes on the memory hierarchy performance. We measure memory access latency and bandwidth as well as cache-to-cache communication delays. We also examine the performance scalability between single and dual cores on the three tested processors.
There are several key design choices for the memory subsystem of the three processors. All three have private L1 caches with different sizes. At the next level, the Intel Core 2 Duo processor adapts a shared L2 cache design, called Intel Advanced Smart Cache for the dual cores [13] . The shared L2 approach provides a larger cache capacity by eliminating data replications. It also permits naturally sharing of cache space among multiple cores. When only one core is active, the entire shared L2 can be allocated to the single active core. However, the downside for the shared L2 cache is that it suffers longer hit latency and may encounter unfair usage of the shared L2 cache. Both the Intel Pentium D and the AMD Athlon 64 Â 2 have a private L2 cache for each core, enabling fast L2 accesses, but restricting any capacity sharing among the two cores.
The shared L2 cache in the Core 2 Duo eliminates onchip L2-level cache coherence. Furthermore, it resolves coherence of the two core's L1 caches internally within the chip for fast access to the L1 cache of the other core [13] . The Pentium D uses an off-chip front-side bus (FSB) for inter-core communications. Basically, the Pentium D is basically a technology remap of the Pentium 4 symmetric multiprocessor (SMP) that requires to access the FSB for maintaining cache coherence [15] . AMD Athlon 64 Â 2 uses a HyperTransport interconnect technology for faster inter-chip communication [2] . Given an additional ownership state in the Athlon 64 Â 2, cache coherence between the two cores can be accomplished without off-chip traffic. In addition, the Athlon 64 Â 2 has an on-die memory controller to reduce memory access latency.
It would be easier to compare memory performance for the three systems with a uniform measurement tool such as the Intel VTune analyzer [16] . However, VTune cannot run on AMD Athlon 64 Â 2. Moreover, the performance counters on AMD present less functions compared with Intel's sophisticated performance counters. It is difficult to match the performance counters across the three processors. Therefore, we decided to use a macro memory benchmark suite, lmbench [30] to examine memory bandwidth and latency of the three processors. Lmbench attempts to measure the most commonly found performance bottlenecks in a wide range of system applications. These bottlenecks can be identified, isolated, and reproduced in a set of small microbenchmarks, which measure system latency and bandwidth of data movement among the processor, memory, network, file system, and disk. In addition, we ran STREAM [21] and STREAM2 [22] recreated by using lmbench's timing harness. These kernel benchmarks measures memory bandwidth and latency using several common vector operations such as matrix addition and copy of matrix.
To understand the data transfer among individual core's caches, we used a small lockless program [26] . This lockless program records the duration of ping-pong procedures of a small token bouncing between two caches to get the average cache-to-cache latency. Finally, we run a set of singleand multi-threaded workloads on the three systems to examine the dual-core speedups over a single core. For single-thread programs, we experiment a set of mixed SPEC CPU2000 and SPEC CPU2006 benchmarks [28] . For multi-threaded workloads, we select blastp and hmmpfam from the BioPerf suites [6] , SPECjbb2005 [29] , as well as a subset of SPLASH2 [34] . Based on the experiment results, we can summarize a few interesting findings.
( This paper is organized as follows. Section 2 briefly introduces the architectures of the three processors. Section 3 describes the methodology and the workloads of our experiments. Section 4 reports the detailed measurement results and the comparison between the three processors. Section 5 describes related work. Finally, we give a brief conclusion in Section 6.
Architectures of dual-core processors
The Intel Core 2 Duo (Fig. 1a ) E6400 emphasizes mainly on cache efficiency and does not stress on the clock frequency for high power efficiency. Although clocking at a slower rate than that of the Pentium D, a shorter and wider issuing pipeline compensates the performance with higher IPCs. In addition, the Core 2 Duo processor has more ALU units [9] . Core 2 Duo employs a shared L2 cache to increase the effective on-chip cache capacity. Upon a miss from the core's L1 cache, the shared L2 and the L1 of the other core are looked up in parallel before sending the request to the memory [14] . The cache block located in the other L1 cache can be fetched without off-chip traffic. Both memory controller and FSB are still located off-chip. The off-chip memory controller can adapt the new DRAM technology with the cost of longer memory access latency. Core 2 Duo employs aggressive memory dependence speculation for memory disambiguation. A load instruction is allowed to be executed before an early store instruction with an unknown address. It also implements a macro-fusion technology to combine multiple micro-operations. Other important features involve support for new SIMD instructions called Supplemental Streaming SIMD Extension 3, coupled with better power saving technologies.
The Pentium D 830 (Fig. 1b) glues two Pentium 4 cores together and connects them with the memory controller through the north-bridge. The off-chip memory controller provides flexibility to support the newest DRAM with the cost of longer memory access latency. The MESI coherence protocol from Pentium SMP is adapted in Pentium D that requires a memory update in order to change a modified block to a shared block. The system interconnect for processors remains through the front-side bus (FSB). To accommodate the memory update, the FSB is located offchip that increases the latency for maintaining cache coherence.
The Athlon 64 Â 2 is designed specifically for multiple cores in a single chip (Fig. 1c) . Similar to the Pentium D processor, it also employs private L2 caches. However, both L2 caches share a system request queue, which connects with an on-die memory controller and a HyperTransport. The HyperTransport removes system bottlenecks by reducing the number of buses required in a system. It provides significantly more bandwidth than current PCI technology [3] . The system request queue serves as an internal interconnection between the two cores without involvements of an external bus. The Athlon 64 Â 2 processor employs MOESI protocol, which adds an ''Ownership" state to enable modified blocks to be shared on both cores without the need to keep the memory copy updated.
An important aspect to alleviate cache miss penalty is data prefetching. According to the hardware specifications, the Intel Core 2 Duo includes a stride prefetcher on its L1 data cache [13] and a next line prefetcher on its L2 cache [9] . The L2 prefetcher can be triggered after detecting consecutive line requests twice. The Pentium D's hardware prefetcher allows stride-based prefetches beyond the adjacent lines. In addition, it attempts to trigger multiple prefetches for staying 256 bytes ahead of current data access locations [12] . The advanced prefetching in Pentium D enables more overlapping of cache misses. The Athlon 64 Â 2 has a next line hardware prefetcher. However, accessing data in increments larger than 64 bytes may fail to trigger the hardware prefetcher [5] . Table 1 lists the specifications of the three processors experimented in this paper. There are no hyper-threading settings on any of these processors. The Intel Core 2 Duo E6400 has separate 32 kB L1 instruction and data caches per core. A 2MB L2 cache is shared by two cores. Both L1 and L2 caches are 8-way set associative and have 64-byte lines. The Pentium D processor has a Trace Cache which stores 12Kuops. It is also equipped with a writethrough, 8-way 16 kB L1 data cache with a private 8-way 1MB L2 cache. The Athlon 64 Â 2 processor's L1 data and instruction cache are 2-way 64 kB with a private 16-way 1MB L2 cache for each core. Athlon 64 Â 2's L1 and L2 caches in each core is exclusive. All three machines have the same size L2 caches and Memory. The Core 2 Duo and the Pentium D are equipped with DDR2 DRAM using advanced memory controllers in their chipsets. The Athlon 64 Â 2 has a DDR on-die memory controller. All three machines have 2GB memory. The FSB of the Core 2 Duo is clocked at 1066 MHz with bandwidth up to 8.5 GB/s. The FSB of the Pentium D operates at 
Evaluation methodology
We installed SUSE linux 10.1 with kernel 2.6.16-smp on all three machines. We used O3 level GCC optimization to compile all the C/C++ benchmarks including lmbench, SPEC CPU2000, SPEC CPU2006, SPLASH2 and blastp and hmmpfam from BioPerf. SPECjbb2005 was compiled using SUN JDK 1.5.0.
We used lmbench suite running on the three machines to measure bandwidth and latency of memory hierarchy. Lmbench attempts to measure performance bottlenecks in a wide range of system applications. These bottlenecks have been identified, isolated, and reproduced in a set of small microbenchmarks, which measure system latency and bandwidth of data movement among the processor, memory, network, file system, and disk. In our experiments, we focus on the memory subsystem and measure bandwidth and latency with various memory operations listed in Table 2 [30] . Among them, we ran variable stride accesses to get average memory latency. In addition, we ran multi-copies lmbench, one on each core to test the memory hierarchy system. We also ran STREAM [21] and STREAM2 [22] that are recreated by using lmbench's timing harness. Each version has four common vector operations as listed in Table 3 . During execution, a 24 MB array stream was allocated. Each vector operation processed array elements one by one. Average memory latencies for these vector operations were reported. Total data size processed in one second was reported as operation bandwidth.
We measured the cache-to-cache latency using a small lockless program [26] . It does not employ expensive readmodify-write atomic instructions. Instead, it maintains a lockless counter for each thread. The c-code of each thread is as follows: *pPong = 0; for (i = 0; i < NITER; ++i) { while ( * pPing < i); * pPong = i+1; } Each thread increases its own counter pPong and keeps reading the peer's counter by checking pPing. The counter pPong is in a different cache line from the counter pPing. A counter pPong can be increased by one only after verifying the update of the peer's counter. This pure software synchronization code generates a heavy read-write sharing between the two cores and produces a Ping-Pong procedure between the two caches to test processors in handling heavy cache-to-cache traffic. For multiprogrammed workloads, the cross-product of mixed SPEC CPU2000/2006 benchmarks were run on the three machines to examine the dual-core speedups over a single core. All the SPEC CPU2000/2006 programs were run with their respective ref inputs. In our simulations, when two programs were run together, we guaranteed that each program was repeated at least four times. The shorter programs may run more than four iterations until the longer program completes its four full iterations. We discarded the results obtained in the first run and used the average execution time and other metrics from the remainder three runs. We calculated the dual-core speedup for multiprogrammed workloads similarly to that used in [25] . The single program's running time were collected individually. Afterwards, the average execution time of each workload when run simultaneously was recorded. The dual-core speedup of each workload is calculated by finding the ratio of average run time when run individually (single core) by the average runtime when run together (dual core). Finally, we add the speedups of the two programs run together to obtain the dual-core speedup. For example, if the speedups of two programs are 0.8 and 0.9 when run simultaneously, the respective dual-core speedup will be 1.7.
We used the same procedure for homogeneous multithreaded workloads including blastp and hmmpfam from the BioPerf suites, a subset of SPLASH2, as well as SPECjbb2005. The BioPerf suite has emerging Bioinformatics programs. SPLASH2 is a widely used scientific workload suite. SPECjbb2005 is a java based business database program. Table 4 lists the input parameters of the multithreaded workloads. We ran each of these workloads long enough to compensate overheads of sequential portions of the workloads.
Measurement results

Memory bandwidth and latency
Lmbench
We first ran the bandwidth and latency test programs present in the lmbench suite. Fig. 2 shows memory bandwidth of several operations from lmbench. Fig. 2a, c and e shows the data collected while running one copy of lmbench on the three machines while Fig. 2b, d Fig. 2b, d and f plots the bandwidth while running two copies of lmbench. The scale of the vertical axis of these three figures is doubled compared to their one-copy counterparts. We can observe that memory bandwidth of Pentium D and Athlon 64 Â 2 are almost doubled for all operations. Core 2 Duo has increased bandwidth, but not doubled. This is because of the access contention when two lmbench copies compete with the shared cache. When the array size is larger than its L2 cache size 2MB, Athlon 64 Â 2 provides almost doubled bandwidth for twocopy lmbench memory read operation compared with its one-copy counterpart. Athlon 64 Â 2 benefits from its on-die memory controller and separate I/O HyperTransport. Intel Core 2 Duo and Pentium D processors suffer FSB bandwidth saturation when the array size exceeds the L2 capacity. Note that the line libc bcopy unaligned coincides with libc bcopy aligned in Fig. 2f .
Next, we examine memory load latency for multiple sizes of stride access and random access for all the three machines. Fig. 3a, c configurations jump after the array size is larger than 1MB. This relates to the L2 cache sizes of the measured machines. (2) As described in Section 2, when hardware prefetchers on all machines work, the memory bus bottleneck will not be reflected. When the stride size is equal to 128 bytes, Pentium D still benefits partially from its hardware prefetcher but the L2 prefetchers of Core 2 Duo and Athlon 64 Â 2 is not triggered. This leads to better performance for Pentium D. (3) When the stride size is larger than 128 bytes, all hardware prefetchers do not take effect. Multiple L2 cache misses put pressures onto the memory buses. Athlon 64 Â 2's on-die memory controller and separate I/O HyperTransport show the advantage. Pentium D's memory latency has a large jump for these operations but Athlon 64 Â 2's latency almost keeps unchanged.
(4) We increased pressure on memory hierarchy by running two copies of lmbench simultaneously shown in Fig. 3b, d and f. We found that Core 2 Duo and Athlon 64 Â 2 have a slight increase in the latencies for stride sizes larger than 128 bytes while Pentium D's latencies increases a lot. Core 2 Duo benefits from its shared cache, which generates lower external traffic, while Athlon 64 Â 2 take the advantage of on-chip memory controller and separate I/O Hyper-Transport. However, Pentium D's latencies jump due to suffering from memory bus saturation.
STREAM/STREAM2
We also ran eight kernel operations from STREAM and STREAM2. Fig. 4a shows memory a single copy of each operation. From this figure, we can see that Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus. Pentium D has better bandwidth than that of Athlon 64 Â 2. This is again because the Pentium D system is equipped with better DRAM than the Athlon 64 Â 2 system. Fig. 4b depicts memory bandwidth when running with two copies of each operation in STREAM/STREAM2, one on each core. From this figure, we can see that Core 2 Duo and Athlon 64 Â 2 have better bandwidth than that of Pentium D. This is due to the fact that Pentium D's FSB is saturated when running two copies of each operation. Athlon 64 Â 2 benefits from its on-die memory controller and separate HyperTransport for I/O although its main memory DDR bandwidth is worse than that of Pentium D. Core 2 duo benefits from the presence of its L1 data prefetchers and the faster FSB. Fig. 4c and d shows the memory latencies for the three machines. Similar to the bandwidth figures, memory latency of Core 2 Duo and Pentium D are shorter than that of Athlon 64 Â 2 when a single copy of the STREAM/STREAM2 benchmark is running. Apparently, the shorter latency from on-die memory controller does not pay off in comparison with an off-die controller with better DRAM technology. However, while running the 2-copy version, memory latency of Pentium D is higher than the other two.
Multiprogrammed workload measurements
We measured execution time of a subset of SPEC CPU2000 and CPU 2006 benchmarks. In Fig. 5a and c, the Core 2 Duo processor runs fastest for almost all workloads executed along, especially for memory intensive workloads art and mcf. Core 2 Duo has a wider pipeline, more functional units, and a shared L2 cache that provides bigger cache for single thread running along. Athlon 64 Â 2 shows the best performance for ammp. ammp has large working set, resulting in high L2 cache misses. Athlon 64 Â 2 benefits from its on-chip memory controller. Fig. 5b and d depicts average execution time of each workload when mixed with another program in the same SPEC suite. There is an execution time increasing for each workload. For memory bounded programs art, mcf and ammp, execution time increasing is large while CPU bounded workloads such as crafty, mesa, perl and sjeng show little impact.
The multi-programmed speedup of the cross-product of mixed SPEC CPU2000 and CPU2006 programs for the three machines are given in Fig. 6 , where C2D, PNT and ATH denote the measured Core 2 Duo, Pentium D, and Athlon 64 Â 2, respectively. We can see that Athlon 64 Â 2 achieves the best speedup for all workloads. Crafty, eon, mesa in CPU 2000 and perl in CPU2006 have the best speedup when run simultaneously with other programs because they are CPUbound programs. On the other hand, in most cases, art shows the worst speedup because it is a memory bounded program. Its intensive L2 cache misses occupy the shared memory bus and block another program's execution. In the extreme case, when an instance of art was run against another art, the speedups were 0.82, 1.11 and 1.36 for Core 2 Duo, Pentium D and Athlon 64 Â 2. Other memory bounded programs, ammp and mcf, present similar behaviors.
Comparing the three machines, the multi-programmed Athlon 64 Â 2 outperforms those of Core 2 Duo and Pentium D for almost all workload mixes. It is interesting to note that even though Core 2 Duo has better running time than the other two machines, the overall speedup is lesser. The reason again is due to its L2 shared cache that boosts single-core performance. PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH   AMMP  ART  BZIP2 CRAFTY EON EQUAKE GAP  GCC  GZIP  MCF  MESA PARSER PERL PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH  C2D  PNT  ATH   ASTAR  BZIP  GCC  H264REF  HMMER  LIBQUANTUMN 
Multi-threaded program behaviors
We used the lockless program to measure the dual-core cache-to-cache latency. The average cache-to-cache latencies are significantly different among Core 2 Duo, Pentium D, and Athlon 64 Â 2, with 33 ns, 133 ns and 68 ns, respectively. This is again due to the fact that Core 2 Duo resolves L1 cache coherence within the chip, while Pentium D requires external FSB for cache-to-cache transfer. Athlon 64 Â 2 uses on-chip system request interface and the MOESI protocol for cache-to-cache communication.
Next, we experiment with blastp and hmmpfam from the BioPerf suite and a set of the SPLASH2 workloads. Fig. 7a and b illustrates execution time of single thread version of the programs and the speedup when running with 2-thread version. In general, Core 2 Duo and Athlon 64 Â 2 do not show performance advantages over Pentium D on bioinformatics and scientific workloads because of limited data communication between two cores. Similar results were also reported on Multimedia programs [9] . Among all applications, Core 2 Duo shows the best speedup over other processors for ocean due to its high cache-to-cache transfers [34] . We verified this behavior using Intel's VTune Performance Analyzer 8.0 [16] . Fig. 8 illustrates the average number of CMP_SNOOP.ANY events, which represents the remote cache access, per 1 K retired instructions on Core 2 Duo. Among all workloads. Ocean has the highest remote cache accesses per 1 K retired instructions. Pentium D shows the best speed up for barnes because of the low cache miss rate [34] . Recall that Pentium D processor also has the best memory read bandwidth when the array size is small. Bioinformatics workloads have high speedups for all three machines due to small working sets [6] .
High data sharing workloads, such as SPECjbb2005 also benefit from fast cache-to-cache latency. Fig. 9 shows the transaction per second (TPS) throughput with one (denoted by 1w-1c) and two (denoted by 2w-1c) warehouses of SPECjbb2005 on the three systems. For a fair speedup comparison, we also run two copies of a single warehouse on two cores (denoted by 1w-2c). These two copies on the two cores compete with the shared L2 cache so that Core-2-Duo loses its unique advantage of taking the entire L2 capacity with a single warehouse. As can be observed, Core-2-Duo shows the worst performance degradation (about 20%) from 1w-1c to 1w-2c. Using 1w-2c as the basis, the 2w-1c speedups for the respective three systems are 1.97, 1.80, and 1.87 where Core-2-Duo is the winner.
Related work
The emergence of Intel and AMD dual-core processors intrigues hardware analysts. There are many online reports which compare performance of processors from both companies [9, 23, 24] . Most of them simply present the performance metrics such as running time and throughput without detailed analysis. In this paper, we focus on the memory hierarchy performance analysis and understanding the underlying reasons. Chip multiprocessor (CMP) or multi-core technology was first reported in [10] . Companies such as IBM and SUN applied it on their server processors [18, 31, 32] . In 2005, Intel announced to shelve its plan in pursuing higher frequency and instead switch to building multi-core processors [15] . Similarly, AMD also made the same decision about the same time [4] .
Tuck and Tullsen [33] studied thread interactions on an Intel Pentium 4 hyper-threading processor. They used multi-programmed and multi-threaded workloads to measure speedup and synchronization and communication throughput. Bulpin and Pratt [7] measured an SMT processor with consideration about fairness between threads. They also showed the performance gap between SMP and Hyperthreaded SMT for multi-programmed workloads.
In [20] , we did a case study on memory performance and scalability of the selected processors. In this journal version paper, we provide more detailed results and analysis.
There are several recent proposals to study the issues of CMP shared cache fairness and partitioning. In [19] , the authors proposed and evaluated five different metrics such as shared cache miss rates, which can be correlated to execution time, used for CMP fairness and proposed static and dynamic caches partitioning algorithms that optimize fairness. This dynamic algorithm can help the operating system thread scheduling and to avoid thread thrashing. Other works proposed OS driven policy [27] , cache management framework (CQoS) [17] and prediction models [8] for inter-thread cache contention in a shared CMP cache.
Conclusion
In this paper, we analyzed the memory hierarchy of selected Intel and AMD dual-core processors. We first measured the memory bandwidth and latency of Core 2 Duo, Pentium D and Athlon 64 Â 2 using lmbench. In general, Core 2 Duo and Athlon 64 Â 2 have better memory bandwidth than that of Pentium D.
We measured individual execution time of SPEC CPU2000 and CPU2006. We also measured the average execution time of each application when mixed with other programs on the dual cores. In general, Core 2 Duo runs fastest for all single and mixed applications except for ammp. We also observed that memory intensive workloads such as art, mcf and ammp have worse speedups. We measured the cache-to-cache latencies. Core 2 Duo has the shortest, while Pentium D has the longest. This generic memory performance behavior is consistent with the performance measurement results of multi-threaded workloads with heavy data sharing between the two cores.
The Core 2 Duo, with its shared L2, demonstrates distinct advantages when running a single program on one core. However, to manage the shared cache resource efficiently is a challenge especially when two cores have very different demands for caches. In summary, for the best performance and scalability, the following are important factors: (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 access latency, and (4) fair resource (cache) sharing. Three processors that we studied have shown benefits of some of them, but not all of them. 
