We demonstrate the bene ts of instruction-set simulation for the evaluation of the Penny system. The usage of a simulator can greatly help understand Penny's behavior for a varied set of benchmarks, in addition to serving as a reliable tool in exploring design alternatives for improving performance.
The Penny System
The Penny system 9] is a implementation of AKL 5] on a shared memory architecture. It will automatically extract parallelism in an AKL program and, during runtime, schedule tasks on a number of workers to achieve speedup. Each w orker is running as a Solaris thread and all threads run in the same Solaris process. The threads share the same address space and have full access to all data structures in the system. This makes the distribution of tasks very easy.
No user annotations are required nor is there any compiler support to extract parallelism. All detection and scheduling of parallel tasks is done at runtime. This is a big advantage compared to systems where the user must provide explicit information so that the compiler can make a static distribution of tasks. The system can utilize both and-and or-parallelism in the program. It is complete with a parallel garbage collector and, for an experimental system, quite stable. The system currently runs under SunOS 5.4 but should not be hard to port to other operating systems.
An abstract machine
The Penny compiler compiles AKL programs to abstract machine instructions. The Penny compiler is itself written in AKL, and consists of about three thousand lines of AKL code that can in turn be executed on Penny itself, with good parallel performance.
The emulator is a threaded code emulator implemented in C using the GNU C compiler (version 2.x) where labels can be handled as data. The machine only de nes sixty-three abstract machine instructions so the emulator itself is rather small. The instruction set is very similar to the instruction set used in the WAM 10] .
When a program is executed, a xed set of workers are created. The number of workers will determine the level of parallelism, so there is no advantage to creating more workers than available number of processors. Each worker will dynamically be assigned work during an execution. If a worker runs out of work it will steal tasks from another worker.
The execution state
During an execution the workers build and modify a shared execution state. The execution state consists of a tree structure of goals and continuations, and a set of AKL terms.
Each goal in the execution state is either ready to be executed or suspended on some AKL variable. When a variable is assigned a value the goals suspended on the variable will be scheduled for execution. It is the worker that assigned the value that is responsible for executing the goals. Continuations represents sequences of un-executed goals and are in the same way o wned by a particular. The right to execute a goal or use a continuation can be \stolen" by another worker. The whole execution state is therefore accessible to all workers.
The data structures that are used to represent goals and continuations can be explicitly reclaimed by the worker that executes the goal. This improves cache performance since the same cache-lines can be reused immediately. AKL terms can of course not be explicitly reclaimed and are therefore allocated on a heap that is subject to garbage collection. A parallel \stop and copy" garbage collector is used. It is important that the garbage collector is parallel since the garbage collection time would otherwise increase in proportion to the execution time. Garbage collection time normally stays well below 10% even for parallel executions where the garbage collector has a hard time keeping up with the overall increased performance.
The SimICS Simulator
When analyzing the behavior of a software system, we need to solve three problems. First, we need to execute the actual instructions speci ed by the executable binary. Second, we n e e d t o p e r f o r m the system services required by the program, if any. The rst two problems thus involve recreating the execution environment. Third, we need to generate information about the execution over and above the actual program result.
There are essentially three strategies to solving the rst problem and the choice of which t o use is the chief characteristics of the analysis method:.
Instruction-set simulation, also called instruction-level or program-driven simulation, is the naive brute-force approach, whereby each instruction in the program is simulated one at a time. This provides an accessible and in some sense correct target machine model for instrumentation, and places minimum restrictions on the architectural relationship between the host and target. Program-driven simulation is probably the oldest strategy 3].
Execution-driven simulation also called program augmentation, involves running a modi ed program binary. The modi cations can be induced at any stage during generation of the binary, either by modifying intermediate program formats (source code, assembly code, object le, or executable binary) or any of the compiler tools (preprocessor, compiler, assembler, or linker).
Host-supported simulation, historically called emulation, requires hardware monitors or other special host hardware features which provide tools to gather statistics or otherwise control the execution of a program.
Each of these three strategies place limits on how w ell we can solve the remaining two problems, handling system services and extracting statistics. Performance wise, host-supported simulation is faster than execution-driven, which in turn is faster than instruction-set simulation. The main performance penalty occurs between execution-driven and instruction-set strategies and can easily be several orders of magnitude. Conversely, instruction-set simulation is more exible than executiondriven, which in turn is more exible than host-supported simulation.
Restrictions inherent in host-supported simulation, other than sheer availability, is selection of what statistics to gather the hardware support will dictate what studies can be done. Restrictions in common for all execution-driven approaches, that the authors have found, includes lack o f s u p p o r t for one or more of: run-time generated code, multiple processes (i.e. workloads), multiple address spaces, system-level (operating system) code, determinism and interactive c o n trol of the execution. Approaches that relax these constraints tend to impose new restrictions on the type of programs that can be studied, including programming paradigms and/or source code modi cations.
SimICS 7] 8] is an instruction-set simulator that has borrowed many design principles from g88 2]. SimICS takes the brute force approach t o all three problems mentioned at the beginning of this section. For this, SimICS take two penalties| rst, we accept a lower performance than specialized approaches. This impact is today on the order of 5-10, or a slow-down of approximately 50 per simulated processor. Second, we need to deal with a signi cantly more complex software engineering problem in building the simulator. This e ect is, of course, di cult to quantify, b u t i t is signi cant.
SimICS provides full pro les, both of execution (instructions) and data cache events, and does so with a traditional debugger environment. This allows a detailed, interactive analysis of parallel programs.
The contribution of SimICS has been to achieve a competitive performance point while at the same time avoiding all the restrictions listed above for host-supported or execution-driven simulation.
Our Target Machine
Our workhorses for Penny timings (and SimICS runs) have been two sparccenter 2000 (SC2000) multiprocessors with 8 and 20 processors, respectively. The SC2000 is a bus-based shared-memory multiprocessor from Sun Microsystems. Figure 1 is a sketch of a generic shared-memory multiprocessor| though the SC2000 is considerably more complex, the gure serves to highlight some principal features.
Each processor has two on-chip caches, one for instructions and one for data. These are generally small, because on-chip area is a scarce resource. On the SC2000's processors, 50MHz Supersparcs, the data cache is 16Kbytes, four-way associative w i t h 3 2 b yte long cache lines. The instruction cache is 20Kbytes, 5-way associative with 64 byte cache lines. Despite their small size, the rst level caches consume half of the Supersparc's 3 million transistors.
The processors connect to an o -chip cache, the second level cache, which on the SC2000 is 2Mbytes, direct-mapped with 64-byte cache lines. These caches are connected to a bus, whereby they can communicate with the main memory and/or other caches. This communication is controlled by SuperCache controllers and \Bus Watcher" chips. There are actually 2 buses on the SC2000, dual 40MHz XDBuses, with a peak sustainable read/write throughput of 500Mbytes per second.
500Mbytes per second may sound high, but each S u p e r sparc processor is capable of executing three instructions per cycle, including supplying 64 bits per cycle from memory, so a 20-processor SC2000 could conceivably request 8 billion bytes per second from the memory system. Hence two layers of caches. The rst level, being on-chip, can react with new data in one cycle. The second level takes 5-10 cycles, whereas accessing the main memory takes 20-60 cycles. A h i g h c a c he hit rate is therefore crucial to good performance.
SimICS emulates the cache hierarchy in gure 1, which is close enough to real life to give a good prediction of performance of an application. In fact, during our pro ling we initially had signi cant discrepancies between predicted performance and measured values, until we discovered that both the SC2000 machines we used for timing measurements had faulty S u p e r sparc processors with only 4Kbyte caches, not 16Kbyte. The faulty processors had gone unnoticed for several years, despite the machine being used extensively for benchmarking of parallel programs. Therefore, all simulated values in this paper assume a 4Kbyte rst-level cache with 32-byte cache lines, and a 2Mbyte second-level cache with 64-byte cache lines, both direct-mapped. 
Using SimICS
In this section, we will use SimICS to study Penny f r o m v arious perspectives. As input to Penny we've selected a small number of AKL programs, listed in table 1. These are used to exercise Penny w i t h a v ariety o f p r o b l e m s . Though running on the same emulator, these programs trigger remarkably di erent behavior.
We begin by generating global statistics to characterize the input programs. In table 2, we've run SimICS on a base version of Penny, with the four example programs and two sizes of machines, 4 or 16 processors. The rst line in the table is execution time as measured on the actual target machine, de ned as total runtime minus initialization. These timings have a standard deviation of approximately 1.5%. The remaining numbers are reported by S imICS running the same workloads. The numbers are the sum of of all CPUs.
Detecting sequential components
We begin with running a very simple AKL program that solves the towers of Hanoi puzzle. Hanoi is an e ective test of how w ell recursive de nitions are executed. Very good speed-up is generally obtained for this benchmark if the sequential component is left out.
When Penny ran Hanoi with four workers under SimICS, simulating the rst-level cache, we rst looked for read cache and TLB misses. (A TLB miss generally causes a cache miss on our target machine, so these events are almost equivalent.) These misses easily stall the CPU if the result is needed soon, which is often the case.
It turned out that a single assembler instruction caused almost 14% of all read misses. The instruction pro ling furthermore revealed that roughly half of all instructions executed were spent in the one enclosing line of C (speci cally, 4 lines of assembler).
The implicated code was not part of the main Penny m a c hinery but was a clean-up procedure that runs after an execution has completed. During execution Penny builds up a linked list of structures, one for each spawned goal. In the Hanoi program the list is only traversed at the end of the execution to determine that no suspended goals exist. Since the result already has been delivered, the time to traverse the list has never shown up during measurements, which i n variably focused on the parallel portions. When the time to do the clean-up was reported as part of the execution time the existence of a performance bottleneck w as obvious. This is a relatively simple algorithm, but it is hard to parallelize because the synchronization is very ne-grained. The problem solved is matching two 400 elements long sequences against each other. As seen the ld %i2 + 4 ], %o0 instruction on address 0x29868 causes a read miss almost every time it is executed. The instruction correspond to the source code shown on line 479 in gure 4 where the pro ling data has been added up for each source line.
The source code is from a procedure that is part of the garbage collector. Many pro lers would have identi ed the procedure as a potential performance problem, but would not have explained why|namely that one line of assembler traversing a list misses the second-level cache over 80% of the time, and misses the TLB almost 4% of the time. This in turn was caused by t h e creation of the list being spread across multiple processors, and would not have been a performance problem in a sequential version of Penny. The procedure is executed during garbage collection and is responsible for distributing work among workers. The read misses in the procedure severely decreased the performance of the garbagecollector since the procedure is a part of a sequential phase of the garbage-collector.
An implementation technique that avoids building the list had been sketched out a year earlier, but the earlier benchmarking techniques had not seen the traversal as a potential problem. Making this correction to Penny improved performance dramatically, as seen in gure 5.
Deciding on prefetching
A collection of abstract machine instructions are used to read and decompose AKL terms. A sequence of instructions are used for decomposing a term, the rst instruction of which, called a get instruction will verify the type of the term. Subsequent instructions, called unify instructions will, in the case of a compound term, access the arguments of the term. In the case of a simple list term this results in three instructions|one for verifying that the term is a list tag and two for accessing the car and cdr of the list cell. Terms are represented by tagged pointers. A get instruction will rst look at the tag of the term and only after having veri ed the tag follow the pointer to verify the functor. A unique tag is used for list terms, so in this case only the tag need to be veri ed.
We noticed an exceptionally high read miss rate in the unify instructions that accessed the components in a term. The reason was that when the instructions were used they were often reading AKL terms that had been constructed by another worker. The terms had thus been constructed in one cache but were often read by another processor.
The remedy to this problem was to add a prefetch instruction (coded in C) in all get and unify instructions. The prefetch w ould read the next argument position so that the following unify instruction would nd the value in the cache. The time to read the argument w ould hopefully overlap with useful work.
The x did not work as expected. The read misses in the unify instructions did decrease but we had added a large number of read instructions, most of which c o n tributed nothing since the data was already in the rst-level cache. The reason for this was obvious. When a functor or argument of a term was inspected the following argument would in seven cases out of eight be in the same cache line of the rst level cache. The same would hold for any prefetch instruction in the unify instructions. The only place where a prefetch instruction would make a n y sense was in the get instruction that was responsible for verifying a list cell. The get instruction itself would only read the tagged pointer to the cell but not the car nor cdr of the cell.
Removing all but the one e ective prefetch left us with an overall performance improvement of 3-4% for several of the workloads.
In this example, we w ere able to use SimICS both to follow where the cache misses moved to, and to quickly quantify the overhead induced by the x. Note that an optimizing compiler could not have i d e n ti ed this prefetch, since it wouldn't know about the restrictions placed on the sequences of abstract instructions|this required the intervention of the Penny designer, using SimICS to explore trade-o s.
How dangerous is a lock?
The Penny system uses locks in two di erent situations. The rst is in the internals of the Penny system when workers move b e t ween di erent parts of the execution state or steal tasks from each other. The second situation is when AKL variables are locked in order to add a new binding or suspension.
Both of these situations could cause a hot-spot in the implementation. Since all locks are spin locks, a worker stalls if a lock is held by another worke r , s o i t i s i n teresting to know h o w often the locks are actually missed and how long it takes for a worker to acquire a missed lock.
To get an idea how often locks are missed, counters were placed around the lock primitives. Counters are added by modifying the source code i.e. by adding SimICS macros that will turn the counters on and o .
We ran the Smith benchmark with sixteen workers. The AKL variable lock counter statistics for one CPU are listed in gure 6. Counter 5 was entered when a worker examined a potential variable and counter 6 was entered if there was a collision. As we can see almost eighty-thousand variables were examined resulting in more than forty-thousand lock (xmem) operations. In only one (!) case does a collision occur. The results were similar for all kinds of locks in the system. These gures indicate that the locks in the Penny machinery are not a performance bottleneck. In fact, it might b e w orthwhile redesigning some of these locks to be more aggressive in assuming low c o n tention.
The use of SimICS counters in this analysis greatly simpli ed instrumentation of the locks. We added half a dozen di erent t ypes of counters, to a few dozen di erent procedures and macros. Though it required modi cation in the source code and re-compilation, the binary can run on the real machine unchanged with an insigni cant e ect on performance. In fact, in the real execution this instrumentation adds less than 4 no-op instructions for every 10000 \real" instructions.
Read and write operations
Though we managed to improve both the initial performance and speedup (parallelism) of the Penny system, there remains a signi cant sequential element that could not easily be explained. Table 3 shows statistics gathered from executing the Smith benchmark on the improved Penny system. The reported numbers are from one of the CPUs. The cache statistics a r e f o r a 2 M b yte cache, i.e. our target machines second level cache.
The number of read and write operations is a good measure of the amount o f \ w ork" performed, and the gures in the table indicate an uncanny linearity in the load per processor, i.e. no signi cant overhead is added. Yet the speedup is not linear. Table 4 shows the improvements for every doubling of numberofworkers. The execution time is reduced to 52% as (speedup 1:9) when going from one to two w orkers. The e ciency then drops and moving from eight to sixteen workers only reduces the execution time to 72% (speedup 1:39).
The explanation for the decreasing speedup cannot be found in the number of read or write operations or the total number of instructions executed. As seen the number of read and write operations per processor (this is statistics from one of the processors) is more or less cut in two f o r every doubling of workers. Table 4 : improvements of operation count
The explanation is found in the number of read and write misses. Table 5 shows the same relative changes but for cache misses. When we go from one to two workers the number of read misses increases with a factor fourteen (!) and the number of write misses stays the same. At t h e same time the total number of read and write operations has been cut in half. This is the most likely explanation why the initial speedup is only 1:9 and not closer to 2:0 as indicated by the operation count.
The read and write misses does behave better once we use more than two processors but does not come close to the gures reported for the instruction counts. It should however be noted that the read and write misses are reduced to 66% or less while the execution time is reduced to 72%. The poor performance when using sixteen workers cannot be explained only by the numb e r o f r e a d and write misses alone. This is investigated further in the next section.
In this analysis, SimICS reports su cient detail to help us reconstruct what is causing speedup to begin trickling o . We n o w k n o w that for larger con gurations, we should focus on second-level cache misses. Many of the data structures in Penny h a ve been optimized to be cache-line aligned, etc. The SimICS' source-line pro ling of second-level cache misses could be used to evaluate di erent design changes aimed at reducing the amount of data communicated even further. Figure 7 : Results of predicting performance using a simple best-t
Explaining overall Penny performance
The brief examples so far in this section have been anecdotal, and were intended to underline SimICS' ability to \zoom in" on and study performance problems, or to explore design alternatives. The criteria we used for deciding what was\good"were a small numberofcharacteristics, essentially the type of data listed in the counters example in gure 6. That these relatively simple statistics are good guidelines can be shown by correlating them against a large number of performance measurements from the real target machine. We ran various combinations of Penny|using eight versions of Penny itself (with di erent improvements and modi cations), the four benchmark inputs, and three levels of parallelism (4, 8, and 16 workers). We measured the median time out of 31 runs, giving us 128 timing points. For each p o i n t, we used SimICS to generate 12 aggregate values, covering the di erent c a c he and TLB miss types in the target system.
We next selected three variables that we presumed to be important for explaining performance, namely the number of memory reads, number of read misses to the rst level cache, and the number of misses to the second level cache. A multiple regression of these variables against the database, and fudging, results in: 0:1 Reads + 0 :33 ReadM isses L1 + 1 0 ReadM isses L2 In gure 7 we h a ve plotted the \explained" time against the real time. The correlation is 0:96, not exceptional but clearly a good indicator. Observing the gure, we note that the performance of large con gurations (16 processors) is overestimated and, conversely, the small con guration (4 processors) is underestimated. We suspect the reason for this is that we are lacking a fourth coecient to measure bus contention, an issue that becomes signi cant a s t h e n umberofcommunicating processors increase.
Misses to the second level cache cause the bus load to increase. The bus is a globally shared resource, so when it approaches saturation it stalls new accesses. As we increase parallelism, the miss rate of reads and writes both increase, which is natural since we are spreading work and communicating more. At the same time, execution time is decreasing, compounding pressure on the bus. This could explain the abnormally high coe cient for read misses to the second level cache, which of course is the one of the three closest correlated with bus contention and thus has to \carry" the bus contention load in the regression.
Concluding remarks
Instruction-level simulation is a very powerful tool when a parallel system is constructed. The cache performance is so important for the overall performance of the system that it is very hard to tune a system without having access to cache miss statistics.
A parallel system can have a perfect behavior with respect to the number of instructions executed and still not show linear speedup. Only when the cache misses are taken in to account c a n the performance of the system be predicted. The importance of good cache performance can not beoverestimated. We h a ve s h o wn a strong correlation between the actual execution time and read operations. It idicates that a second level read miss is two order of magnitudes more expensive t h a n a rst level hit. This suggest that a read miss rate of the second level cache of over one percent will dominate the execution time.
Analysis of cache performance of logic programming systems is not new 4] but our approach is quite di erent. We h a ve analyzed a the performance of a parallel system on an existing parallel architecture. The analysis is done not trough a generated trace le of selected read and write operations but from all operation actually performed by the system. The number of read and write operations performed by the larger benchmarks is over 100 million. The total number of instructions executed could be as high as 700 million. SimICS also allows to do interactive performance debugging since hot-spots are easily located. We know not only the overall cache performance but can pinpoint the instructions that generate misses.
Acknowledgments
The parallel implementation of AKL has been developed using the AGENTS 1.0 6] system as a starting point. Haruasu Ueda did much of the implementation and analysis of the scheduler. Gallal Atlam and Kahyri Ali, designed and implemented the garbage collector 1].
Bengt Werner co-designed much o f t h e S imICS front-end semantics. Anders Landin has been an enthusiastic supporter of SimICS and has contributed much to the discussion of what a user needs to know about program/architecture interaction. David Samuelsson wrote much o f t h e sparc V8 interpreter. Henrik Forsberg wrote much of the Unix emulation.
Thanks to Peter Fritzson at Link oping University for access to a 20-processor SC2000 for the Penny timings.
Various parts of this work have been sponsored by Ellemtel in the Entreprise and Hubble projects, Sun Microsystems in the SOS project, the European Commission in the GPMIMD project, ACCLAIM Esprit project, EP 7195 and SICS.
