Commercial workloads are an important class of applications for multiprocessor servers, and their simulation is essential for computer architects to evaluate future server designs. However, simulating expensive servers running these large workloads on low-cost personal computers presents many challenges. The workloads must be scaled down and tuned to fit within our simulation environment. Simulation time must be made tractable, since simulations are much slower than native machine execution. Simulators should model a sufficient level of timing detail to allow evaluating research ideas.
Introduction
In this Internet era, database management systems and web servers form an integral part of the business and communications infrastructure. These and other commercial applications store, provide access to, and manipulate critical personal and business data. The more people depend on these applications, the more important it becomes to run them reliably and efficiently. The goal of our group at the University of Wisconsin is to develop innovative approaches for improving the performance, efficiency, and reliability of the multiprocessor servers used to run these important commercial applications.
Computer hardware designers and researchers often use execution-driven simulation to evaluate design alternatives and research ideas. By simulating the hardware, execution-driven simulation captures actual program behavior and detailed system interactions. Simulation is more flexible and less expensive than hardware prototypes and models important system details more accurately than analytic models.
The combination of demanding workloads and large systems is especially difficult to simulate on the inexpensive, less powerful machines available to most researchers. Commercial workloads, unlike simpler workloads, rely heavily on operating system services, such as input/output, process scheduling, and inter-process communication. Therefore, only simulators that model these services will run such workloads correctly. Multiprocessor servers introduce the additional challenges of accurately modeling the interactions between multiple processors, large main memories, and many disks. Computer architecture researchers must therefore balance three goals to make effective use of their limited resources: (1) developing a representative approximation of these large workloads, (2) achieving tractable simulation times, and (3) simulating a sufficient level of timing detail.
To achieve these goals, we developed a simulation methodology based on multiple simulations and careful attention to the effects of scaling on workload behavior. We use Simics [4] from Virtutech AB, a full-system functional simulator, and we extend it with detailed timing models. Our workloads, the Wisconsin Commercial Workload Suite, currently consist of four benchmarks that approximate several important classes of commercial applications. We scaled these workloads down (in both size and time) to allow simulations on a $2K PC.
Workload Scaling and Tuning
Many multiprocessor commercial workloads are too large to simulate on current PCs. Many are also extremely sensitive to workload tuning parameters. The Wisconsin Commercial Workload Suite (see sidebar) is a set of benchmarks that have been scaled down to a size suitable for simulation, and tuned to approximate real workloads. This suite consists of an On-Line Transaction Processing (OLTP) workload, a Java middleware workload, a static web content server workload, and a dynamic web content server workload. We scaled down the workloads to reduce memory and disk usage because our simulation host machines are much less powerful than current servers. In our case, we were limited to PCs that each had 1 GB of RAM, a single disk, and a 32-bit virtual address space. We tuned all of our workloads on a real multiprocessor server to discover and remove performance bottlenecks. The following case study describes the development of our OLTP workload, which highlights the importance of tuning commercial workloads.
Tuning improved our OLTP workload performance by a factor of 12.
Case Study: On-Line Transaction Processing. OLTP is one of the most commercially important applications for multiprocessor servers today. OLTP systems form the core of the business computing infrastructure in banking, airline reservations, online stores, and other industries. In large businesses, these systems often process hundreds of
Wisconsin Commercial Workload Suite
Online Transaction Processing (OLTP): DB2 with a TPC-C-like workload. The TPC-C benchmark is widely used to evaluate system performance for the on-line transaction processing market. The benchmark itself is a specification that describes the schema, scaling rules, transaction types and transaction mix, but not the exact implementation of the database. TPC-C transactions are of five transaction types, all related to an order-processing environment. Performance is measured by the number of "New Order" transactions performed per minute (tpmC), subject to certain constraints.
Our OLTP workload is based on the TPC-C v3.0 benchmark. We use IBM's DB2 V7.2 EEE database management system and an IBM benchmark kit to build the database and emulate users. We build an 800 MB 4000-warehouse database on five raw disks and an additional dedicated database log disk. We scaled down the sizes of each warehouse by maintaining the reduced ratios of 3 sales districts per warehouse, 30 customers per district, and 100 items per warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C specification). Each user randomly executes transactions according to the TPC-C transaction mix specifications, and we set the think and keying times for users to zero. A different database thread is started for each user. We measure all completed transactions, even those that do not satisfy timing constraints of the TPC-C benchmark specification. Java Server Workload: SPECjbb. Java-based middleware applications are increasingly used in modern e-business settings. SPECjbb is a Java benchmark emulating a 3-tier system with emphasis on the middle tier server business logic. SPECjbb runs in a single Java Virtual Machine (JVM) in which threads represent terminals in a warehouse. Each thread independently generates random input (tier 1 emulation) before calling transaction-specific business logic. The business logic operates on the data held in binary trees of java objects (tier 3 emulation). The specification states that the benchmark does no disk or network I/O.
We used Sun's HotSpot 1.4.0 Server JVM and Solaris's native thread implementation. The benchmark includes driver threads to generate transactions. We set the system heap size to 1.8 GB and the new object heap size to 256 MB to reduce the frequency of garbage collection. Our experiments used 24 warehouses, with a data size of approximately 500 MB.
Static Web Content Serving: Apache. Web servers such as Apache represent an important enterprise server application. Apache is a popular open-source web server used in many internet/intranet settings. In this benchmark, we focus on static web content serving. thousands of transactions per minute. This high throughput is provided by multiprocessor systems (or clusters of systems) that cost millions of dollars. The primary benchmark to compare the performance and cost of OLTP systems is the Transaction Processing Performance Council's TPC-C benchmark [6] . The published TPC-C results demonstrate the large scale of many commercial workloads. For example, a current non-cluster TPC-C performance leader is a database server with 128 processors (each with 8 megabytes of cache), 256 gigabytes of RAM, and 29 terabytes of disk storage on 1,627 disks [6] . The clients emulated nearly 400,000 users placing orders at about 40,000 warehouses. The system completed over 100 million transactions during the 25 minute warm-up and two hour measurement periods. The total hardware and software cost of the system is more than $13 million.
Millions of dollars may be a reasonable price for a computer system that runs the core business application of a major company, but it is currently unrealistic for a single research group. Our objective for OLTP was to develop a workload that captures the important characteristics of real-world OLTP systems, but is small enough to use in our simulations.
We started with the TPC-C benchmark specification, IBM's DB2 database management system, and a TPC-C benchmark kit provided by IBM. We set up, scaled down, and tuned the workload on an actual multiprocessor (a Sun E5000 with twelve 167 Mhz processors and 2 GB of memory), and we then moved exact disk images of the workload into our simulation environment. Using a real machine allows long measurement intervals, makes benchmark setup and tuning much faster compared to simulation, and provides data that we can use to validate simulation results.
Initial scaling. Ideally, we would like to explore systems with a large numbers of disks and large database sizes.
However, simulating such systems is currently infeasible within our simulation infrastructure. Therefore, we reduced the size of the database to 1 GB so that the whole database would fit in the memory of our real system and simulation target machine. TPC-C models the database activity of a wholesale supplier with a number of geographically distributed sales districts and associated warehouses. The benchmark specifications state that size of the database should be set by the number of warehouses, keeping the relative sizes of the other tables the same. Since the approximate size of all data associated with one warehouse is 100 MB, we created a 10-warehouse database on a single disk (plus an additional log disk). However, we measured a much lower throughput (in terms of transactions per minute) than expected from similar systems.
Raw device access and other parameter tuning. Next, we tuned several kernel and database configuration parameters (e.g., kernel limits on the number of shared memory segments and semaphores, and database limits on threads and locks). We also reconstructed the database on a raw database-managed disk, since using normal operating system files for database tables increases database overhead and results in buffering of the data in both the operating system file cache and the database buffer pool. The combination of both of these changes improved performance by 144%.
Multiple disks. Analysis using operating system profiling tools showed that the database disk was now the likely bottleneck. Although the database data was sized to fit in memory, the frequent updates in TPC-C caused substantial disk write traffic. To alleviate this problem, we partitioned the 1 GB database across five raw disks. This change removed the I/O bottleneck, further improving performance by 81%.
Table contention reduction.
Although we had dramatically increased performance, the operating system profiling tools still showed a large amount of system idle time. We discovered that the database system was serializing transactions that read and wrote the same entries in the small "warehouse" table, limiting system throughput. To eliminate this bottleneck, we deviated from the standard TPC-C scaling requirements by increasing the number of warehouses without increasing the total database size (see the "Wisconsin Commercial Workload Suite" sidebar). This change resulted in a database with 4000 warehouses and led to another 111% improvement in performance.
Additional concurrency. Our initial setup used 24 client emulators, which meant that we had, on average, two database threads running on each of our twelve processors. Although we eliminated think and keying times for the emulated clients to reduce client overheads, there was not enough concurrency in the system to hide the computational and I/O latency. We increased the number of emulated user threads to eight per processor (96 total), which provided an additional improvement of 29%.
OLTP tuning summary. The remarkable improvement in OLTP performance shows that tuning commercial workloads is essential for obtaining representative workloads. Figure 1 plots the normalized throughput of our OLTP workload at each of our five stages of tuning. It shows that our final configuration has a throughput twelve times that of our original setup. After careful tuning, the throughput of our OLTP workload is close to published TPC-C results for similar hardware. More importantly, our workload is far more representative of a real OLTP system. This figure shows the transaction throughput for each of our tuning attempts, normalized to our initial setup.
Workload Runtime and Variability
Simulation is orders of magnitude slower than real system execution. However, benchmarking commercial workloads often involves running for a long warm-up and measurement intervals to avoid cold-start and transient effects. The combined effect of these two factors presents a challenge for commercial workload simulation. For example, we
observed an approximately 24,000x slowdown factor when simulating a 16-processor system with our detailed timing model. At this rate, simulating the minimum required measurement interval for the TPC-C benchmark (two hours)
would take more than five years, and simulating even one minute would take weeks.
In order to make evaluating commercial workloads practical, we needed to scale down these long intervals and to develop an economical methodology for dealing with cold-start and transient effects. Our methodology consists of three parts. First, we avoid the warm-up overhead by starting from an already warm workload setup. Second, we sample a small portion of the workload by counting transactions. Third, we handle the variability due to measuring short intervals by averaging measurements from multiple runs.
Starting With Warm Workloads
To avoid the need to simulate the startup and warm-up phases of our commercial workloads (e.g., starting the transaction request generators, creating database processes, warming up the buffer pool or page cache), we use Simics's ability save a checkpoint (snapshot) of the architected state of the simulated system. Checkpoints include the state of all processors, memory, devices, and disks. For each of our workloads, we simulate a reasonable warm-up period and create a checkpoint that records the state of a warm system. Starting our timing simulations with a warm system reduces the simulation time required and also mitigates cold-start effects.
A Fixed-Transaction-Count Simulation Methodology
Since we are unable to simulate benchmarks from start to finish, we need to limit the length of our measurement interval to keep simulation time reasonable. A standard approach to measure performance on partial benchmark runs for simpler workloads (e.g., the SPECcpu2000 benchmarks) is to record the number of cycles required to execute a fixed number of instructions. The resulting metric, IPC (Instructions Per Cycle), corresponds exactly to performance for user-mode, single-threaded, uniprocessor simulations. Unfortunately, IPC does not correspond to throughput on multiprocessors. For example, spending more time in the operating system's idle loop (or waiting in a loop to acquire a lock), can actually improve the IPC of an application, while reducing its throughput. Therefore, applying this same approach to multiprocessor commercial workloads is inappropriate and can lead to incorrect conclusions about workload performance.
Instead, we measure the time required to finish a certain number of transactions of a benchmark [1] . We use the number of cycles per transaction as an inverse-throughput metric 1 to compare the throughput of different configurations.
Since our commercial workloads are all throughput-oriented, the transaction (or request) concept existed in all of 1 . We use cycles per transaction (instead of transactions per cycle) to allow for easy comparison with non-transaction based workloads in which we consider the entire computation a single "transaction".
these workloads and was readily applicable. We modified the transaction generators for each of our workloads to alert the simulator (using a special instruction with no side effects) whenever a transaction completes. During simulation measurement experiments, the simulator counts the number of transactions completed and stops when it reaches the desired count.
Variability of Short Simulations
Since we are practically limited to short simulation runs, the measured workload throughput becomes more dependent on the execution path of the workload, or the exact sequence of instructions executed during the simulation. For example, execution paths differ due to different orders of thread interleaving caused by operating system scheduling decisions and different orders of lock acquisition. This increases the effect of an important phenomenon for short simulation runs, namely variability in workload timing results. Variability refers to the differences between multiple estimates of a workload's performance [2] . Accounting for variability is critical to researchers who evaluate their architectural innovations by comparing the performance of their enhanced designs relative to a base configuration, since a performance difference due to workload variability might otherwise be attributed to a real difference in the relative performance of the enhanced and the base systems.
Variability in many commercial workloads is large enough to affect research conclusions. To solve this problem, we obtain multiple performance estimates by introducing an artificial source of variability (adding a small random delay to each memory access). The average memory latency is the same for all simulations, but each simulation will follow a different execution path by using different random seeds. The simulation results presented in Figure 3 illustrate the risk of using single simulation runs. The average performance for all twenty runs confirms the intuitive conclusion that OLTP performs better on the 4-way set associative L2 cache configuration.
However, if we performed a single experiment for each configuration, we might conclude that the 2-way set associative configuration performs better (e.g., when comparing the minimum runtime of the 2-way configuration with the Figure 2 ). Each data point represents the number of cycles per transaction of one execution path. This figure demonstrates that different execution paths happen even for the same workload and system configuration, and can lead to a wrong conclusion. maximum of the 4-way configuration). If we randomly select one run from each configuration, there is a 31% chance of drawing the wrong conclusion.
This experiment shows that we cannot rely upon a single short simulation run to obtain correct conclusions in comparison experiments. We handle variability by using multiple simulations for each configuration. We use the average simulated runtime (or cycles per transaction) to represent the workload's performance, and we use the standard deviation to establish confidence intervals. This approach greatly reduces the probability of reaching a wrong conclusion compared to single-run experiments, at the expense of increased total simulation runtime. However, if multiple simulation hosts are available, running multiple short simulations in parallel is preferable to running one long simulation.
We also developed a more sophisticated statistical methodology that helps achieve reasonable simulation time limits while, at the same time, reducing the probability of reaching a wrong research conclusion [2] .
Timing Simulation of Commercial Workloads
Simulating commercial workloads running on a multiprocessor is more complicated than simulating single-threaded, user-level benchmarks running on a uniprocessor. In order to run unmodified commercial applications, a simulator must implement all the instructions in the simulated architecture and model all supported devices. Simulating multiprocessor servers requires coordinating events from multiple processors and a more detailed memory system simulation.
To manage this complexity, we leverage Simics, a functional simulator that can execute unmodified operating systems (e.g., Solaris). Simics allows us to study these complex workloads and servers without implementing a complete simulator from scratch. Instead, we can focus our development efforts on writing timing models for those parts of the system that concern our research. We extend Simics with two timing models, a memory simulator and a detailed processor simulator. Our memory simulator implements a two-level cache hierarchy, a cache controller, an interconnection network, and (optionally) a directory controller. The "Full-system Simulation with Detailed Processor Timing" sidebar describes our processor model and provides an example of the way we combine the complete functional simulation of Simics with a specialized timing model.
To reduce complexity, our timing models approximate some details of target systems, but we attempt to capture those aspects that have significant effects on system timing. For example, our memory system simulator models the states and transitions for different cache coherence protocols and the various latencies and bandwidth limitations of caches, memories, and links in interconnection network. However, it uses approximate models for DRAM, disks, I/O timing, and references to memory-mapped I/O registers.
These approximations increase the importance of validating our simulation results by comparing against real system measurements. We consider validation an important but difficult component in our simulation efforts, one that is far from complete at present. Even though a validated cycle-accurate simulator is necessary for an absolute performance prediction, it may not be necessary for most architecture studies. Validation efforts should focus on the intended use of the workload and simulator, and we think that our current models are sufficient for most of our applications.
Evaluating Research Ideas
The outcome of computer architecture research experiments depends greatly on the workloads used for evaluation.
Frequently, these experiments compare the performance of an enhanced system to a base system without the enhancement. However, a design decision that increases performance for one benchmark may have the opposite effect on another. Thus, computer architects should evaluate their ideas with the most relevant workloads. For multiprocessor servers, commercial workloads are most relevant. The following case study describes the development of the Bandwidth Adaptive Snooping Hybrid (BASH) cache coherence protocol, which was motivated by commercial workloads.
Case Study: BASH
One major design decision for multiprocessor server architects is the choice of a cache coherence protocol. A cache coherence protocol is the mechanism by which a system coordinates reads and writes to a memory location. At the heart of this decision is a trade-off between cache miss latency and traffic on the system interconnect. The two major categories of cache coherence protocols are snooping and directory protocols. Directory-based systems suffer from long latencies for cache-to-cache transfers (i.e., cache misses supplied from another processor's cache), since each data request is first sent to the directory, which then forwards it to a processor that can provide data. Snooping systems reduce the latency of cache-to-cache transfers by having each processor broadcast all its requests to all processors, which allows requests to find the data provider directly. Unfortunately, broadcasting generates significant traffic on the system interconnect, especially for systems with a large number of processors. Our workload characterization, as well as previous characterizations [3] , show that cache-to-cache transfers are prominent in commercial workloads and that these misses have a significant adverse effect on performance. We explored the performance of both protocol
Full-system Simulation with Detailed Processor Timing
To develop a detailed processor timing model, we used a decoupled technique called Timing-First Simulation [1] . We implemented Timing-First simulation by using two simulators, a timing simulator (augmented to functionally execute most instructions) and a full-system functional simulator. This arrangement can be viewed as an almost correct integrated simulator followed by a correct functional simulator checker. Timing-First simulation decreases the implementation complexity of the timing simulator, since it neither has to exactly functionally implement the entire instruction set, nor must it implement I/O and other system devices. The simulator is able skip instructions that are unimportant to timing fidelity without introducing functional errors in the system's simulation.
Our implementation of timing-first simulation models a detailed out-of-order processor executing the SPARC V9 instruction set. The detailed processor can simulate timing effects including speculative execution of instructions, several branch prediction schemes, multiple outstanding memory references, and limitations on the number of functional units and instruction issue bandwidth. In the timing-first organization, the timing simulator controls when each processor in the functional simulator can advance. When an instruction retires, the timing simulator steps the appropriate processor, but must verify the results of its functional execution. This immediate, precise feedback helps maintain the correct functional behavior in the timing simulator. The timing simulator can determine the winner when a race to memory occurs (by advancing one processor before another), but it does not modify the actual state of the functional simulator. While a timing error is introduced using this approximation technique, it does not have a significant impact on overall system timing. This error is proportional to the instructions that do not match between functional and timing simulators, which are only 0.001% of all instructions on average for our commercial workloads. categories on several commercial workloads and scientific benchmarks. For experiments with a moderate amount of system interconnection bandwidth (Figure 4) , we find that while a directory protocol outperforms a snooping protocol for some workloads, the converse is true for others.
Motivated by this result, we developed a hybrid protocol, Bandwidth Adaptive Snooping Hybrid (BASH) [5] , that acts like a snooping protocol if sufficient bandwidth is available, but gracefully degrades to act like a bandwidth-efficient directory protocol when bandwidth is scarce. The system monitors the interconnect utilization and adjusts the rate of broadcast requests accordingly. The broadcast rate is decreased if the interconnect utilization is too high (to avoid congestion delays) and increased if utilization is too low (to reduce latency by broadcasting). Figure 4 shows that our protocol performs equally well or outperforms the better of snooping or directory systems for all of our workloads. The benefit is significantly greater for our commercial workloads, compared to the scientific workload BarnesHut from the SPLASH-2 benchmark suite [7] , because of their higher frequency of cache-to-cache transfers.
Although BASH performs well for Barnes-Hut, the difference is not compelling. On the other hand, the substantial performance improvements for commercial workloads makes BASH an attractive alternative for future multiprocessor server designs.
Conclusion
Commercial workload simulation is essential for computer architects to evaluate future server designs. However, simulating these large workloads-designed to run on multi-million-dollar servers-on low-cost PCs presents many This figure shows the performance of three different protocols (directory, snooping, and BASH-a hybrid protocol) for our four commercial workloads and one scientific application. The system has 16 processors, but broadcasts requests use 4x their normal bandwidth to approximate performance on a larger system. The endpoint bandwidth available per processor was 1600 MB/second. As discussed earlier, we use multiple simulations to mitigate the effect of variations in the workloads due to short samples; the height of a bar represents the average, and the error bars show the standard deviation.
challenges. However, we approximated the behavior of such workloads in the Wisconsin Commercial Workload Suite. We plan to continue expanding the workload suite by including additional middle-tier applications. We are working with Virtutech AB to make simulation checkpoints of our workloads available to the research community.
