Abstract-Computer system performance is a very complex process in which the hardware and software manufacturers invest important human and financial resources. Workload characterization represents an essential component of performance analysis. This paper presents a trace based methodology for software applications evaluation. It introduces a new analysis concept designed to significantly ease this process and it presents a set of experimental data collected using the new analysis structure on a representative set of scientific and commercial applications. Several important conclusions are drawn regarding workload characteristics, classifications and runtime behavior. This type of data is used by the computer architects in their efforts to maximize the performance of the hardware platforms these applications are going to execute on.
I. INTRODUCTION
Performance of computer systems has been a prevalent topic in the high-tech industry. The computer manufactures invest important human and equipment resources to improve it at both the hardware and software level. Microprocessors in particular over the years incorporated an impressive set of ideas, inventions and innovations which reflected in a smaller or larger gain in performance. In practice performance improvement covers two aspects: 1. At the hardware level requires architectural designs optimized with regard to the number of resources and the chip area occupied by each individual resource (functional and control units etc.). In addition the microprocessors need to provide mechanisms to overcome hazards and certain events with a negative impact on the execution speed of an application. 2. At the software level requires extensive studies of workload characteristics of representative sets of applications covering the most important areas in the software space. In addition it requires the design of elaborate compilation techniques meant to improve those areas of an application which contribute to a lower performance level when running on certain hardware platforms.
This paper focuses mostly on the second aspect, improving performance at the software level with emphasis on workload characterization of a broad portfolio of applications ranging from scientific to commercial areas. It presents the accepted methodology used by major computer manufacturers highlighting several important steps. On that ground it describes methods for trace file generation and validation. It introduces a new concept which laid the foundation of a new analysis structure which allows a complex trace based workload analysis. It presents a large set of characterization data for a representative set of benchmarks as well as real compute-intensive scientific and commercial applications.
II. MOTIVATION
Workload characterization represents an important step in designing future computer systems and it influences the design at all system levels from the hardware platforms to operating systems. A thorough understanding of workload properties and the way they execute sheds light on resource utilization and it can guide performance optimization at both the hardware and software level. In addition, knowing the runtime characteristics of an application it is useful when making important decisions about hardware resource allocation. There is a need to know details about the structure and the properties of an application to understand the reasons for low runtime performance on hardware platforms.
Although over the years workload behavior has been an area of active research not in very many cases the studies provided a comprehensive analysis of all the important aspects for performance. In general the studies were either focusing on a subclass of software applications or on only a subset of workload characteristics. In addition there are few studies of computer-intensive applications in which the properties of these applications are compared and analyzed side by side. The experimental data presented in this paper gathers important information related to different performance characteristics important in the analysis of scientific and commercial workloads like: instruction decomposition, memory access properties and memory reuse patterns. These types of results influence the design decisions on a wide range of system components from microprocessors to memory hierarchies, compilers and operating systems. The applications evaluated as part of this study were chosen among several different workload classes starting from standard benchmarks characterized by homogeneous characteristics to more irregular scientific and commercial applications as follows: standard benchmarks HPC challenge 1 and NAS [1] ; real scientific applications covering particle physics GTC [2] , computational chemistry NAB [3] and structural analysis LSDyna 2 . For completion the scientific applications were analyzed in contrast to two commercial applications TPC-C 3 and SAP SD 4 . The experimental results were generated using unmodified versions of these applications and they were compiled with standard optimizations. Therefore the executables are representative of the ones used in practice. The techniques presented in this study are accepted and widely used in the field of computer systems design.
III. WORKLOAD CHARACTERISATION; METHODOLOGY
Instruction traces represent an established way to evaluate software applications characteristics. Providing a way to explore the interaction between the software and the hardware platform it is running on, traces represent a good way of studying workload performance. In general trace files are collected using functional simulators running both the reference workload and the operating system. This way the trace contains not only the user instructions (instructions part of the application itself) but also information about the interaction between the user application and the operating system. Hence the functional simulator records in the trace the operating system instructions executed while running the user application as well as certain events like traps, I/O and DMA.
Trace validation is the process in which the content of a trace is verified to accurately reflect the runtime behavior of an application and requires several steps as it is presented in Figure1. [2] a) The reference application is installed on an existing hardware platform called reference system. b) To insure correctness the application is run on the reference system several times. During this process representative regions of the application are identified and marked to precisely identify the trace collection starting point. In addition several workload and operating system metrics are collected to serve as data points for trace validation. c) Once the application runs correctly on the reference system a disk image of the entire file system is taken. This is later used to exactly reproduce inside the functional simulator the way the application executes.
d) The functional simulator is configured to mimic the exact configuration of the reference hardware system. This way it is ensured equivalent application behavior in both environments.
e) When the execution hits one of the previously marked interest points a checkpoint which represents the entire state of the system at that moment in time is collected.
f) The simulation is restarted from the checkpoint and the writing of the instruction trace begins. Trace validation represents an important milestone in trace collection. This process insures that the behavior captured in the trace is identical to the reference hardware executing the same portion of the application. Several operating system metrics like number of system calls, traps, etc. together with benchmark results are collected on both the reference and the simulated hardware. These metrics are later on compared as part of the validation process. A trace is declared to be valid if the difference between the values on the reference hardware and the simulated hardware is within 5-10%. Another step in the validation process requires inserting library calls (libcpc
5
) inside the reference application to bracket the portion of interest. These libraries come with the operating system and allow the hardware counters on the reference system processors to be initialized to count metrics of interest. This way, for example, we can collect the total number of instructions or the number of cache misses on the reference hardware from the exact portion where the trace was collected. These numbers are later on compared with similar numbers generated from memory hierarchy or performance simulators running on the trace. A trace file is said to be valid if the benchmark results, operating system statistics as well as the metrics collected by the hardware counters on the reference hardware system are identical or within a small margin of error to similar metrics collected from the functional or performance simulators running on the trace. Figure 2 shows an example of validation between the reference hardware system and the simulated machine. Level two (L2) cache misses were chosen as an example. The data presented in the graph shows a very good correspondence for most of the application with the exception of RandomAccess. 5 http://developers.sun.com/solaris/articles/hardware counters.html 
IV. A NEW CONCEPT IN PERFORMANCE EVALUATIONS
Performance prediction and evaluation is a complex process in which depending on the type of analysis and the level of complexity, together with hardware counters and system tools the performance engineers use a set of software analysis tools designed to ease the process and to reduce the time it takes to obtain the desired information. These structures can operate independently or as part of performance simulators. A wide variety of analysis tools is currently available in the industry from the ones operating only on instruction traces to the ones providing insight into the way the applications perform both on existing and future hardware platforms [4] . As a result of analyzing several such products like Sun Studio Performance Analyzer [5] , Intel Vtune Performance Analyzer [6] MemSpy [7] , CPROF [8] , gprof [9] , TraceVis [10] and TAXI [11] it was found they suffer from one or more of the following: can be used only on existing hardware platforms, need access to the application's source code as well as recompilations of it or require major changes to the performance simulator. To solve some of these shortcomings the idea of a new analysis concept came to mind. This concept was the foundation of a new tree like analysis structure which was designed to be modular, flexible and versatile, easy to understand and modify when specific tasks require enhancements to it. It was designed to provide workload characterization data on both the existing hardware platforms as well as the future ones. It can operate on traces as well as part of performance or memory hierarchy simulators.
The new tool was designed as a modular tree like structure, built out of two interchangeable modules which represent the analysis blocks in every configuration:
 analyzers (modules for analysis and classification), classify the data streams into tokens which are going to be analyzed by the lower levels of the analysis tree. Classification can be accomplished on processor id, process or context id, function id etc.  profilers (modules for counting and/or profiling), they are the analysis tree's leafs which count/profile the data stream tokens passed down by the analyzers.
6 www.sun.com/processors/ultrasparc-iii
As opposed to the analyzers profilers can't decide on the data stream and they can't be connected with each other. The only connection allowed is to one of the analyzers. The next paragraphs present a serie of experimental results which emphisize the new structure's potential to generate complex workload characterization data based on instruction traces. In the following example it was highlighted the characteristics of each of the modules used as part of the exemplified analysis tree. The purpose of this analysis was to gather information about the way the workload instructions access the memory segments. The virtual memory space allocated to a process by Solaris or other Unix like operating system is divided into at least four memory segments [12] : text, data, heap and stack. The text and data segments are mapped directly from the executable stored on disk and contain the instructions as well as the initialized variables of the application of interest. The heap memory is allocated dynamicaly during the run and it can grow on demand. On a function call a new frame is added to the stack containing the return program counter as well as the function arguments and local variables. Because memory segments serve different purposes during a workload run their access patterns can be quite different.
The analysis experiment that follows was designed to provide trace based information on memory segments access patterns for a subset of benchmarks part of SPEC CPU 2000 7 Figure4 shows the results gathered for each segment. The characterization data was obtained using the analysis tree shown in Figure 3 .  process analyzer; it is utilized in multiprocessor systems and it classifies the data streams based on the process which generated them.  context analyzer; clasifies the data streams based on the context identifiers. This way it is possible to analyze parallel applications. Allowing multiple threads to run in parallel operating systems identify them based on their context id. 7 www.spec.org  segment analyzer; clasifies the data streams based on the virtual memory segments they access (stack, heap, text, data etc.). The analysis structure can evaluate the user as well as the kernel memory segments. This module reads a configuration file containing segment mappings for all the active processes in the system and provides a way to correlate certain events like TLB or cache misses with the memory segment responsible for them.  memory footprint profiler; extracts the physical memory address in each instruction and reports the growth of the memory footprint over time for that set of instructions. Memory footprint represents the number of unique words touched by a sequence of instructions.  working set profiler; extracts the physical memory address in each instruction and reports the growth of the working set over time for that set of instructions. Working set represents the number of unique words touched by the instruction sequence in n previous accesses. From the characterization data presented in Figure 4 several conclusions can be drawn: the data, heap and stack memory segments show different growth patterns fact hard to discern if they would have been analyzed together. For example crafty's heap segment doesn't grow much but the application continuously accesses new data in the data segment. While most of the applications show stepwise growth in their text segments their stack remains steady. This shows that although the applications follow different execution paths the maximum number of function calls is reached at the beginning of the execution.
Gap has the longest function call chain while gcc shows the largest text footprint. Bzip2 and gzip although they are part of the same class of applications, data compression, behave differently; bzip2 uses a larger heap and data segment, and the segments grow in steps while gzip reaches steady state quickly.
Statistical data of this kind can serve different purposes; validating that the instruction traces collected from functional simulators reflect the application behavior on the reference hardware systems and secondly that they capture a representative region of the application. The above example highlights the potential of the analysis structure to extract complex workload information based on traces.
V. WORKLOAD CHARACTERIZATION STUDY ON A REPRESENTATIVE SET OF SOFTWARE APPLICATIONS
This chapter presents more complex workload characterization examples using the new analysis structure. A study emphasizing common characteristics as well as differences was conducted over a representative set of applications. This type of data is used by the architects to make important decisions regarding the runtime behavior of these applications on existing or future hardware platforms. Some of the data characterization that follows was also presented in [22] .
A. Instruction decomposition
Instruction decomposition provides critical data about the instruction mix of an application and represents an important step in performance analysis.
The most important instruction categories are: integer instructions, floating point instructions, branches, memory access and prefetch instructions. Figure 5 shows the instruction breakdown for each of the studied application. The instructions shown are: floating point instructions (add, multiply, load/store); integer instructions (arithmetic and logic, load/store); branch and software prefetch instructions. A few important conclusions can be drawn based on the data shown in the graph: The two commercial applications TPC-C and SAP SD have a similar instruction breakdown while the HPC applications are quite diverse. The variability in the instruction mix shown by the HPC applications highlights the impossibility of characterizing the dynamical behavior of this type of workloads as a group based on studying the behavior of only a few. Another observation is the reduced percentage of floating point instructions in the compute-intensive scientific applications. Although these applications are known for doing intense floating point numerical calculations the percentage of this type of instructions is around 20% and not exceeding 50%. This fact contradicts the traditional way of characterizing the performance of this kind of applications based on measuring the number of FLOPS (floating point operations per second) only, ignoring completely the integer component. In Figure5 these workloads show a significant integer component, several applications like RandomAccess, STREAM, NAS CG and GTC contain more integer instructions than floating point. In many cases although the majority of the computation in scientific applications operates on floating point data this is matched by array indexing and offset calculations which is accomplished through integer operations. This explains the high number of integer instructions in linear algebra workloads. An important conclusion can be drawned based on this information; scientific applications need a more general method for performance analysis rather than over-emphasizing the number of FLOPS. The balance between momory operations (loads, stores) can be studied based on the data in Figure 5 . If we only look at integer load and store frequency we notice they are well balanced in both scientific and commercial applications while the floating point loads are dominant. This can be attributed to the fact that for most of the floating point calculations it is required to read two operands and write only one.
B. Memory access analysis
It is well known the fact that the scientific workloads are the ones which stress the computational system the most. It is also known that the performance of all software applications is heavily influenced by the way they access the memory. As a result this section of the paper is focused on temporal and spatial locality of memory accesses.
1) Temporal locality
Temporal locality represents the characteristic of an application to access the memory location it recently accessed [13] . Typically this tendency is characterized by measuring the reuse distance. The reuse distance represents the total number of memory access between two consecutive accesses of the same location. This study presents an analysis methodology similar to the one presented by Weinberg at all [14] .
Temporal locality score is given by (1):
 For an access to memory location A, reuse or reuse distance represents the distance measured in the number of unique memory addresses accessed since the last access of A.  reuse i represents the fraction of memory accesses with a reuse distance less than or equal to i.  N represents the maximum reuse distance. A score = 0 implies a total lack of temporal locality while a score = 1 implies that all memory accesses are similar to the smallest measured reuse distance.
2) Spatial locality
Spatial locality is defined as the characteristic of an application to access memory locations close to each other [13] . Spatial locality can be evaluated studying the difference between the addresses of two memory locations neighbor in time.
Similar to temporal locality, Weinberg and all [14] defined the metric for spatial locality of an application. Spatial locality score is given by (2):
 stride i represents the fraction of memory locations for which the difference between the address of the current memory access and the accesses from a window of size W equaled i. Using this metric we can say that a program which accesses consecutive memory locations will have a spatial locality score = 1 while a program accessing every other memory location will have a spatial score = 0.5. A program which accesses memory locations distributed between 1 and 2 will have a score = 0.75 while a program which accesses memory randomly will have a score = 0. Figure 6 shows the spatial and temporal locality of memory accesses for all the applications part of the study. For computing the spatial locality score a window W=32 was chosen. The scores for RandomAccess, STREAM and FFT are according to the expectations. Because of the way they are designed RandomAccess and STREAM continuosly access new memory locations without reusing data. Because of the randomness of RandomAccess both the temporal and spatial locality score of this application are approximately 0. The spatial locality score of STREAM is high while the temporal is near 0. This is because this application doesn't reuse the data it previously accessed. It is also shown that several scientific applications like NAS, BT, GTC and NAB have spatial locality characteristics closer to commercial applications rather than the HPC Challenge benchmarks. NAS CG and LSDyna show lower temporal locality scores because they access distributed data. All the other scientific applications have a tendency to access data from memory locations previously accessed. This is reflected in their fairly high temporal locality score.
3) Global considerations on data locality

VI. RELATED WORK
Similar to the trace based instruction decomposition study presented in this paper, Rupnow et. al. [15] provide an instruction category breakdown of SPEC-FP2000 8 and an internal benchmark suite from Sandia National Labs. The authors concluded that performance estimations based solely on SPEC-FP analysis undervalues the importance of integer instructions in scientific applications. This paper extends their work drawing a parallel between scientific applications and commercial OLTP benchmarks. Characterizations often use standard methodologies for the analysis of the memory access properties of a workload [16] . Regularity of memory accesses can be based on the analysis of striding properties [17] . A particular aspect of memory usage that has received significant attention is the characterization of the locality of memory accesses in an application [18, 19] . This approach can be combined with APEX [20] , where benchmarks are developed to set data locality to a desired level. The resulting methodology allows the analysis of spatial and temporal locality of memory accesses of an application in an architecture-independent way [14] .
VII. CONCLUSIONS
Workload characterization of applications running on hardware platforms is a critical component in computer 8 www.spec.org systems design. This paper describes the methodology followed by chip manufacturers to obtain application characteristics. It presents the content of instruction traces as well as methods for collecting and validating them. An example of validation for a large set of commercial and scientific applications is shown. The miss rate in the level two, L2 cache measured by the hardware counters on a SunFire 6800 UltraSparc-III microprocessor was compared against the similar metric reported by a functional simulator running on traces. This work also introduced a new concept in performance evaluation which was the foundation for the design of a new tree like analysis structure. The potential for extracting complex information about software applications was highlighted through examples. Finally a complex study and a comparison between representative scientific and commercial applications were conducted. All experimental data was collected using the new analysis structure. Several characteristics were shown as part of the study starting with instruction decomposition and continuing with spatial and temporal locality of memory accesses. Similarities and disparities between workload characteristics were highlighted and several important conclusions were drawn. In the future we would like to expand this work to collect similar workload characteristics on other classes of software applications.
