An application's cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit.
I. INTRODUCTION In a computer system, energy consumption, execution time and overall system performance during execution of an application are greatly influenced by both the cache miss rate and the configurations (combinations of different cache parameters such as the number of cache sets (set size), associativity, line size (block size), etc.) of the cache memories in the memory hierarchy. Cache miss rates of the same trace using various cache configurations are generally unpredictable. Hence we need to find the cache miss rates in different cache configurations to decide the most suitable configuration. When the total number of cache misses is known for a particular application and cache memory, using analytical models such as the one proposed in [19] , energy consumption by the subject cache memory, execution time for the application and overall system performance can be estimated quickly. Therefore, in deciding the best cache memory for an embedded system [14] [13] [19] [27] and in timing analysis [26] [28] , cache miss rates of applications are widely used. Thus, to determine the best cache configuration, given power area and performance constraints, cache miss rates for all configurations must be found first.
To save time in detecting the cache miss rate of an application on a particular cache memory, simulation of the application's memory access trace with the least possible hardware details is widely used instead of real application executing on real cache memory. A robust, time saving and resource generous variant of the trace driven simulator is the single-pass simulator (e.g., [12] [13] [14] [29] ). In a single-pass simulator, multiple cache configurations are simulated together while reading one application's trace of memory accesses only once. Besides reducing trace reading time, single-pass simulators deploy several other speedup mechanisms such as customized data structures to represent cache memories and to search and update data in the cache memories quickly (e.g., associativity list in [19] , "Wave" in [13] , "CLT" in [14] , etc.). In addition to customtailored data structures, use of cache inclusion properties, introduced by Mattson et al. in [22] , is also popular in reducing cache simulation time. An inclusion property indicates when all the elements within one cache configuration are known to be present in other configurations. Therefore, inclusion properties allow some of the simulation steps to be avoided, saving simulation time enormously when a large group of cache configurations are simulated together.
Cache inclusion properties do not hold for First-In-First-Out (FIFO) caches [22] . Previous studies have predicted the status of single FIFO cache configurations [9] [10] [25] [26] . However, the methods in those articles do not translate well to predict the contents of multiple caches in single-pass simulation simultaneously.
As a cache replacement policy, FIFO has several advantages. Among the replacement policies, caches with FIFO replacement policy demonstrate lower energy consumption, especially compared to a cache with LRU replacement policy [4] . Their simple design makes FIFO caches inexpensive to implement. Due to these reasons, FIFO is widely used as the cache replacement policy in embedded processors (e.g., Tensilica Xtensa LX2 processors [32] , Intel XScale [1] , ARM9 [3] and ARM11 processors [2] ). Therefore, a fast simulator to decide the cache miss rate of an application on various FIFO caches is indeed in a great demand. To meet this demand, smart data structure based cache simulators such as [13] [14] are in use. However, the possibility to utilize cache inclusion properties in addition to smart data structures would be of great use in reducing simulation time further. To the best our knowledge, no simulator has ever been proposed that utilizes any FIFO cache property to reduce single-pass simulation time significantly.
In this paper, for the first time, we take an initiative to reduce single-pass simulation time for FIFO caches utilizing some FIFO cache properties. We introduce a cache property, the "Intersection property", that can help to speed up cache simulation using the same principle of inclusion properties. We have presented three cache intersection properties for the FIFO replacement policy. Utilizing these three intersection properties and custom tailored space and time saving data structures, we have proposed a new single-pass FIFO cache simulator "CIPARSim". CIPARSim is the first of its kind to utilize any FIFO cache property, such as the intersection property, to speedup simulation. Experimental results show that CIPARSim outperforms the available single-pass FIFO cache simulators SCUD [14] and DEW [13] significantly for all the SPEC CPU2000 [16] and Mediabench [20] applications tested. We consider SCUD as the state of the art single-pass FIFO cache simulator as DEW can simulate only caches with varying set sizes in a single pass over the trace file.
Problem statement: Given an application's memory access trace and a set of cache configurations using the FIFO replacement policy, reduce the simulation time to find the cache miss rates of all given cache configurations executing the trace by utilizing FIFO cache properties.
Layout: The rest of the paper is structured as follows. Section II presents the related works, Section III introduces the concept of cache intersection properties and presents three FIFO cache intersection properties, Section IV describes the new rapid single-pass FIFO cache simulator CIPARSim with its custom tailored data structures, Section V describes the experimental setup and discusses the results found for SPEC CPU2000 and Mediabench applications; and Section VI concludes the paper.
II. RELATED WORK
Mechanisms for acceleration of trace driven simulation to find cache miss rate have been studied for a long time for further improvement. Depending on the accuracy of the simulation results, these acceleration techniques can be categorized into two categories. The methods with limited accuracy are called estimation methods (E.g., [7] [18], etc.). These heuristics dependent methods are fast; however, not preferred when accuracy of simulation result is required. Several proposals for acceleration of trace driven cache simulation without affecting the accuracy of the results have been proposed, too. These proposals can be broadly categorized into (i) Compressed trace simulation, (ii) Parallel simulation and (iii) Single-pass simulation.
In a compressed trace simulation, redundant information are pruned to compress the application's memory access trace. As the compressed trace is often considerably shorter than the actual memory access trace, simulation time can be reduced significantly. However, success of these simulators relies on the compressibility. In addition, time to compress and decompress the trace file accurately adds overhead to the actual cache simulation time. Some examples of compressed trace simulation approaches are [23] [30] [31] .
To reduce the overall simulation time, several proposals, called "Parallel simulation", were made to perform the simulation of a group of cache configurations in parallel on multiple processors. Depending on the source of parallelism, these proposals can be categorized into several subcategories. The proposal in [5] is based on set-parallelism and simulates each cache set of a cache configuration on different processors. Han et al. proposed a method in [11] that not only exploits setparallelism but also parallelizes searches for the requested data block in a particular cache set. Similarly, Heidelberger et al. introduced time-parallelism in [15] and Nicol et al. proposed stack distance based parallel simulation in [24] . Parallel simulation methods undoubtedly speed up the simulation process. However, their main limitation is in the high resource demand to perform simulations in parallel. Due to their resource hungry behavior, implementation is costly too.
In contrast to parallel simulation, one processing unit is used as optimally as possible in a single-pass simulation approach. Therefore, single-pass simulation can be combined with parallel or compressed trace simulation for further speedup. Single-pass simulation approaches usually exploit inclusion properties and custom-tailored data structures to reduce processing time without any help of extra hardware. In the article [17], published in 1989, Hill et al. studied the effect of varying associativity in caches in search for a rapid single-pass cache simulation approach. Sugumar et al. [27] made an effort to exploit the cache inclusion properties when they introduced the use of binomial trees to speed up LRU single-pass cache simulation in 1995. Their proposed method improved the method of [17] . Utilizing binary trees and some cache inclusion properties based on the LRU replacement policy, Sugumar's method was able to simulate multiple cache configurations very quickly in a single pass over an application trace. In 2004, Li et al. [21] proposed an advancement to Sugumar's proposal through a compression method to reduce simulation time. In 2006, Janapsatya et al. [19] proposed a method to traverse the binary tree in a top-down fashion to exploit the temporal locality in cache line accesses. In 2009, the CRCB algorithm [29] improved the simulation time of Janapsatya's technique by using a runtime pruning of the trace file and two inclusion properties. In 2009, Haque et al. [12] showed that, instead of top-down traversal, bottom-top traversal of the binary simulation tree will enable the simulator to exploit a different set of inclusion properties for LRU caches. In 2010, DEW [13] and SCUD [14] were proposed to perform rapid single-pass simulation of FIFO caches exploiting custom-tailored data structures. However, the space hungry behavior and time consuming manipulation of those custom tailored data structures left room for further improvement. To the best of our knowledge, until today no cache property has been proposed for FIFO caches to be exploited in the single-pass simulation for acceleration of operation.
A. Our contributions
1) For the first time, a cache property called the "Intersection property" has been introduced in this article to predict the existence of a memory block content in multiple FIFO caches during single-pass simulation.
2) Three FIFO cache intersection properties have been proposed that can be used to reduce simulation time in FIFO single-pass simulation.
3) A rapid single-pass FIFO cache simulator "CIPARSim" has been proposed that, utilizing the proposed FIFO intersection properties, shows significantly faster performance than the available FIFO cache simulators.
To the best of our knowledge, CIPARSim is the first single-pass cache simulator that uses FIFO cache properties to reduce simulation time. This holds between alternative caches that have the same cache line size, do not pre-fetch, have the same number of sets, and the replacement policy must induce a total priority ordering on all previously referenced memory blocks (that map to each cache set) before each reference and use only this priority ordering to decide the next replacement cache block. The LRU replacement policy is shown to have this feature.
When inclusion properties hold between two caches, just by simulating the smaller cache, we can realize which memory block contents will be available in the larger cache. Therefore, U when any of the contents from the smaller cache is re-accessed, simulation can be avoided in the larger cache for that access.
Similar to inclusion properties, intersection properties we 
A. FIFO intersection properties
We now present and prove several FIFO cache intersection properties useful for the rapid single-pass simulation of FIFO cache configurations. In this context, it is assumed that cache line size is constant for all considered cache configurations. The terms 'larger' or 'smaller' applied to cache configurations refer to the set size and/or associativity. E.g., a 'larger' cache has equal or greater set size and equal or greater associativity, with at least one being greater.
Intersection Property 1: When an element is inserted into a FIFO cache of associativity A X , if the element's location within a large FIFO cache with associativity A Y is at least
((2 ×A X ) −3
) elements away from the replacement pointer of the larger cache in the direction of replacement, it guarantees the existence of the memory block content in the larger cache at least as long as it remains in the smaller cache.
Proof: To prove the first cache intersection property, we have to figure out the maximum number of insertions possible in the larger cache, after insertion of an element in the smaller cache. By analysis, the situation that will cause this to occur is as follows:
Let 'I' be the element that is most recently inserted (MRI) into the smaller cache. Let there also be one element in the smaller cache which is missing in the larger cache. We will call this uncommon element 'U'. In order for 'U' to exist, there must be an element 'R' in both caches that replaced 'U' in the large cache. The 'remaining' (A X − 3) elements in the smaller cache appear at and following the replacement pointer in the larger cache. Figure 1(a) shows such a layout using 8way and 16-way caches. The 'remaining' elements are lettered 'A' through 'E' and the other elements in the larger cache are labeled as '-' as their value do not matter. 'LRI' indicates the least recently inserted element that suppose to be replaced at the next insertion.
The access pattern required is to first access element 'U' which replaces the first 'remaining' element ('A' in the example). The 'remaining' elements are then accessed in the order new elements can be accessed which will replace all elements except 'I' in the smaller cache as well as (A X − 1) elements in the larger cache. This is the largest amount of replacements that could occur in the larger cache without replacing 'I' in the smaller cache and the total number of insertions is
. The final situation in our previous example is shown in Figure 1 (b) with the new elements that were inserted marked as '+'. Thus, as long as Proof: Call the MRI element of a set within a 2-way cache 'X ' and the other element 'Y '. Call the time immediately before 'X ' was inserted t 1 , and the time when 'X ' was inserted t 2 .
At t 2 , 'X ' is the most recently accessed element and is thus present in all cache configurations. The counter-example to the intersection property can only occur if 'X ' is not present in the larger cache at a future time t 3 . Only 'X ' and 'Y ' can be accessed between t 2 and t 3 as otherwise, 'X ' would no longer be the MRI. The counter-example requires that 'X ' is not present in the larger cache at t 3 , yet it was present at t 2 , so it must have been replaced by 'Y ' at t 3 . Thus, at t 2 , 'X ' was due as the next element for replacement and 'Y ' was not present in the larger cache. It follows that 'X ' is not the MRI of the larger cache at t 2 and thus, 'X ' was present and 'Y ' was not present at t 1 . At time t 1 , Y was the MRI of the 2-way cache. Thus, in order for 'X ' to fit the requirements of the counter-example at t 3 , 'Y ' must fit the requirements of the counter-example at the previous time, t 1 . So the counter-example can only exist if it existed at a previous time.
When the cache is first filled, compulsory misses must occur, ensuring all elements from the 2-way cache will be present in the larger configurations. As the counter-example is not present at this point and must have occurred at a previous point in order to occur again, it can never occur.
Intersection Property 3: Using the FIFO cache replacement policy, the non-most recently inserted (NMRI) element of any set in a 2-way associative cache of set size S 2−way must be present in all larger FIFO cache configurations if the NMRI element has been accessed after the insertion of the MRI element.
Proof: We know that the MRI element must be present in all larger FIFO cache configurations due to intersection Property 2. Furthermore, the most recently accessed (MRA) element of the 2-way set must be present in all larger configurations according to the CRCB algorithms [29] . If the NMRI element has been accessed after the insertion of the MRI element, it would at that time be the MRA element and must be present in all configurations. As there are only 2 elements in the 2way cache set, no accesses to any other elements which map to the set can occur as that would cause a new insertion. The same elements which map to the set in the 2-way cache map to the corresponding set/s in larger cache configurations. Hence, no insertions can occur in the larger configurations as both the MRI and NMRI elements were present when the NMRI element was accessed and those are the only elements to be accessed since that time. Thus, as long as the NMRI element has been accessed after the insertion of the MRI element, it will be present in all larger cache configurations.
IV. CIPARSIM SIM ULATION APPROACH
CIPARSim utilizes the proposed cache intersection properties with the help of special data structures to reduce single-pass simulation time of FIFO caches with the same cache line size. In this section, we are going to discuss the data structure used in CIPARSim along with the simulation approach.
A. Data structure
CIPARSim maintains a look-up table to store all the memory block addresses which are present in the FIFO caches during single-pass simulation. For each memory block address, one look-up table entry is created which is accessible by using the memory block address as the key. Bit arrays are stored in each look-up table entry to indicate which cache configurations have stored the key memory block; one bit array per cache set size being simulated. Each bit in a bit array represents the presence of the memory block within a particular associativity being simulated. When a bit is set (to 1), it indicates a cache miss; otherwise, a hit. CIPARSim maintains separate look-up tables for data and instruction accesses. During simulation, just by analyzing the look-up table entries/bit arrays for the requested memory block addresses, CIPARSim decides cache hits and misses in all the cache configurations under simulation.
The look-up table entries are arranged into smaller sets and sorted according to their keys. The mapping of memory blocks to look-up table sets is equivalent to the mapping to the set associative cache sets. Binary search is used on the keys/memory block addresses inside a look-up table set to find the appropriate entry.
In Figure 2(a) , an example of a CIPARSim look-up table has been presented. The example Look-up table has two sets and is suitable for simulating two different cache set sizes (1 and 2) with three different associativities (2, 4 and 8 in the example). Set 0 of the look-up table has two binary memory block addresses "10010" and "11010". Each memory block address is associated with two bit arrays of three bits each. The rightmost bit of a bit array reflects the smallest FIFO associativity, which is 2 in the example, and associativity increasing as we move to the left. For example, with address "10010", the bit array '011' is associated for set size 2. The bit array '011' indicates that the content from the memory block address "10010" is absent in the FIFO cache with associativity 2 and set size 2, and in the cache with associativity 4 and set size 2. However, it is present in the cache with associativity 8, set size 2 and same cache line size. As soon as a memory block content is evicted from all the FIFO cache configurations under simulation, the memory block address is also evicted from the look-up table to keep the look-up table at its smallest possible size. For example, the memory block address "10011" in Figure 2 (a) will be evicted from the look-up table.
To assist in updating a particular bit array in a look-up table entry, CIPARSim maintains a binary tree, called a simulation tree. Each level inside the tree represents the FIFO cache configurations of a certain set size (the configurations still vary in associativity). Each tree node represents a FIFO cache set. To simulate fully-associative FIFO caches, the top level of the simulation tree would have only one node. Subsequent child levels continue to double the set size. All the FIFO cache configurations represented by a simulation tree have equal cache line size. Figure 2 (b) illustrates a simulation tree starting with a FIFO cache configuration with set size of two (see top level nodes 0 and 1). The first node on the left, stamped '0', refers to cache set with index 0 in the cache with set size 2. And the second node with token '1' refers to cache set 1. At the second level of the two trees, there are a total of four nodes stamped '00', '10', '01' and '11'. Thus the second level represents a FIFO cache with set size of four, and the numbering within the nodes represent the respective cache sets as shown in Figure 2(b) . Similarly, the third level (illustrated as the bottom level in Figure 2 (b)), will represent a FIFO cache with eight sets. More caches with bigger set sizes can be represented by expanding the tree further. We assume the traditional mapping from memory address to sets is performed by taking the lower bits of the memory address to determine the set. Thus, the elements that map to any node of a tree will be mapped to its child nodes in the next level and only its child nodes.
To represent associativity, one list for each associativity with fixed length equal to the associativity is associated with each tree node. Each node inside a list represents a cache line and will have a pointer to the memory block address inside a lookup table set to indicate which memory block resides there. Each node in a simulation tree also stores the most recently inserted (MRI) memory block address of associativity 2 when caches with associativity 2 are simulated. Separate storage of associativity 2's MRI helps to exploit the second intersection property of Section III-A. The order of the nodes inside the associativity lists will be maintained as in the FIFO caches. In CIPARSim, a bit called "Track Flag" is also stored with each tree node when caches with associativity 2 are simulated. associativity 2, the Track Flag is set to false (or 0). If an existing memory block of the associativity 2 with the selected tree node is re-accessed, the Track Flag is set to true. The Track Flag helps to exploit the third intersection property of Section III-A. A cache hit in the FIFO associativity 2 with Track Flag set to true indicates that the memory block content is available in all the larger FIFO cache configurations in the same simulation tree. If the smallest associativity is larger than two, a bit "Intersection Flag" is associated with each cache line in the smallest associativity list of a simulation tree node. These extra bits will help to utilize the first intersection property of Section III-A. Whenever a new memory block tag is inserted in a cache line in the smallest associativity list of a tree node, the Intersection Flag is set to true if the same memory block tag is at least ((2 × A X ) − 3) (where A X is the smallest associativity) elements away from the replacement pointers in the other larger associativity lists in the same tree node. Therefore, when a memory block content in the smallest associativity in a tree node is re-accessed, simulation can be avoided in the larger associativities if the Intersection Flag is found true. In CIPARSim, the smallest associativity must be 2 or larger if M RA tag is not save for each simulation tree node separately to simulate direct mapped caches.
Whenever a new memory block is inserted into the list for
In Figure 2(c) , an example tree node '00' from Figure 2 (b) is presented with two FIFO associativity lists illustrated (namely associativity=2 and associativity=4). Node '00' is from the second level in the tree of Figure 2(b) . The second level of the tree of Figure 2 (b) represents a FIFO cache with set size 4. The first cache line of the list for associativity 2 has a pointer to the memory block address "1101100" in the look-up table of CIPARSim. Using these pointers, CIPARSim can update the look-up table's bit arrays when the address "1101100" will be evicted from the associativity 2's list in tree node '00' due to a miss for that node.
B. CIPARSim simulation approach
To simulate an application trace, CIPARSim reads one requested memory block address at a time from the trace file and evaluates it in the FIFO cache configurations under simulation. CIPARSim does not simulate consecutive request for the same address. For a requested memory block address, cache hit/miss evaluation continues from the smallest to the largest FIFO cache
set size in the look-up table. For a particular cache set size, cache hit/miss evaluation continues from the smallest to the largest associativity lists before moving to the next set size. In other words, cache simulation starts in the top level's smallest associativity list in a simulation tree and finishes in the bottom level's largest associativity list. Function AddressEvaluation illustrates the process to evaluate an address request (RA) to determine hit and miss for the FIFO cache configurations (with set size S = 2 L , associativity A and same cache line size) under consideration. We assume that the associativities are 2 i where i ≥ 1 and the smallest set size is 1 to simulate fully associative caches. In the function, LT represents Look-up Table. A textual description of the flow of the Function AddressEvaluation is given below:
1. RA evaluation starts from searching the address in the appropriate LT set using binary search. If the address is not found in the look-up table, CIPARSim declares a cache miss for all the cache configurations. RA is placed in the LT and a pointer to RA's location in LT is placed in every cache configuration. RA is placed in the M RI of associativity 2 and associativity 2's T rack F lag is set to false if the selected cache memory's associativity is 2 (see Section IV-A). If the smallest associativity is not 2, I ntersection F lag is set to true (see Section IV-A).
2. When RA is found in LT , CIPARSim selects the cache set sizes one by one, starting from the smallest cache set size, and evaluates all the different cache configurations with the selected cache set size. For a selected cache set size, associativities are selected for evaluation one by one starting from the smallest associativity. When a cache miss occurs in a cache memory, CIPARSim records a cache miss and places a pointer in the selected configuration to point to the location of RA in LT . On a cache hit, CIPARSim just records a cache hit and continues evaluation to the next cache memory. However, some extra steps are necessary when the selected cache configuration has the smallest associativity. If the smallest associativity is 2 and RA is missing in the cache with the smallest associativity, T rack F lag for that cache is set to false and RA is set as the M RI of that cache. If RA is found in that cache, cache hit is recorded for all the remaining configurations if the T rack F lag is found true (see the third intersection property of Section III-A); and after that evaluation is stopped for RA. If RA was found in the selected cache but the T rack F lag is not set to true, the M RI entry is checked. If RA is found as the M RI , cache hit is recorded for all the remaining cache configurations (see the second intersection property of Section III-A), T rack F lag is set to true; and after that evaluation is stopped for RA. However, if RA is not the M RI , a cache hit is recorded for the selected cache and next associativity is picked for evaluation after setting T rack F lag true. When the smallest associativity is larger than two, and RA is found in the selected cache with the smallest associativity, cache hit is declared for all the configurations with the same set size if the I ntersection F lag is true (see the first intersection property of Section III-A). If RA is in the smallest associativity cache but the I ntersection F lag is False, CIPARSim records cache hit for the current cache and continue evaluation to the next configuration. If RA was not found in the selected cache with the smallest associativity, CIPARSim updates the I ntersection F lag of the selected cache according to the first intersection property of Section III-A. That means, I ntersection F lag is set to true if all other cache configurations with the same set size has RA and in those larger configurations, RA is ((2 ×A X ) −3) (where A X is the smallest associativity) elements away from the replacement pointers in the direction of replacement. Otherwise, I ntersection F lag is set to false.
V. EXPERIMENTAL PROCEDURE AND RESULTS
To determine the acceleration gained by CIPARSim, we have compared its simulation time with the available single-pass FIFO cache simulators SCUD [14] and DEW [13] . For this purpose, we have re-implemented both SCUD and DEW following the specifications provided by the reference articles. Like CIPARSim, SCUD can simulate FIFO caches with varying set sizes and associativities in a single-pass. However, DEW can simulate FIFO caches with varying set sizes only in a singlepass. As there was no parallelization in use, DEW was repeated multiple times on the simulation machine to simulate different associativities. In each repetition, the trace file was read once in DEW.
To compare the performance of these simulators, twentysix SPEC CPU2000 benchmark applications (which are mainly general purpose/scientific computation applications) and six Mediabench applications (which are mainly embedded system applications) were used. Applications were executed in "Sim-pleScalar/PISA 3.0d" [6] to generate the memory trace files. We use SimPoint [8] to identify the most relevant stage in the SPEC CPU2000 programs. Each SPEC CPU2000 application trace is then generated by simulating 300 million instructions within the point identified by SimPoint. This reduction was performed due to the large size of memory traces generated by each SPEC CPU2000 application (SPEC CPU2000 programs ran upwards of 10 billion instructions). Application traces were fed into all the simulators we have implemented. All of these simulators were executed on a machine with a dual core Opteron64 2GHz processor, 8GB of main memory and 1MBytes L2 cache predistributed among the processing cores. Note that trace driven cache simulators are used to find the number of cache misses for an application trace and they do not produce wrong/ different results even if executed on a general purpose processor or embedded processor, as long as the trace file is same. Due to space limitation, simulation results are presented only for six SPEC CPU2000 applications and five Mediabench applications in this article. Name of the applications are presented in the first column of Table II . The applications are selected depending on their total number of memory accesses. We have selected some applications with very few memory accesses (e.g.; sixTrack and JPEG decode), some applications with many memory accesses (e.g.; eon and MPEG2 decode) and the remaining applications in-between the extremes (e.g.; ammp and G721 decode). The number of memory accesses in each application is presented in the sixth column in Table II . Associativity=2 
TABLE I CAC H E C O N FI G U R AT I O N PA R A M E T E R S
To compare CIPARSim's performance with SCUD and DEW, 300 FIFO cache configurations (non unified caches) were simulated on each of the three simulators to generate each application's total number of cache misses. Table I shows how the 300 FIFO cache configurations were derived from the cache parameters.
In Table II , the simulation times of DEW, SCUD and CIPARSim have been been presented. Column 2 presents the cache line size (Only 4, 16 and 64 Bytes are presented due to space limitation). Columns 3 to 5 present the simulation Table I for the particular cache line size. For each and every application, CIPARSim showed significantly faster performance than DEW and SCUD. Over DEW, CIPARSim showed the highest speedup of 10 times for application "eon" and block size 64 Bytes. In this case, DEW's simulation time was 53.35min and CIPARSim's time was 5.16min. Over SCUD, CIPARSim showed 5 times speedup at best for application "eon" and cache line size 4 Bytes. In this case, CIPARSim's simulation time was 27.31min whereas SCUD's execution time was 1.19hour. On average, CIPARSim is 5 times faster than DEW and 3 times faster than SCUD. CIPARSim's speedup (which is (DEW or SC U D simulation time)/(C I P ARSim ′ s time)) has been presented in Figure 3 for all the six SPEC CPU2000 and five Mediabench applications.
During simulation of CIPARSim, we have recorded the number of cache hits predicted by the intersection properties discussed in this paper. Columns 10 and 11 present the total number of cache hits predicted by the second and third intersection property respectively while simulating the 60 cache configurations for each cache line size. The total number of cache hits that occurred is presented in column 7. From the results, it can be seen that the second and third intersection properties together can predict up to 90% of the total cache hits (for "sixTrack" the total number of hits is 22 billion with 19 billion hits predicted by the second intersection property and 178 million are predicted by the third intersection property).
To check the effectiveness of the first intersection property, we have added the I ntersection F lags with the tags in associativity 4. As the first intersection property can be applied when the smallest associativity is greater than 2, we have presented, in column 8, the total number of cache hits that would occur when associativities 4, 8 and 16 were considered only. From the results, it can be seen that the first intersection property alone can predict up to 60% of the total hits which is observed again for "sixTrack" and block size 64 Bytes. In this case, the total number of cache hits is 17 billion, 10 billion of which are predicted by the first intersection property.
As a very large number of cache hits were predicted (on average, 65% hits are predicted) in CIPARSim by the intersection properties, much of the time consuming simulation steps were avoided. In addition to the profound role of intersection properties in reducing simulation time in CIPARSim, the data structure also played a noticeable role. Unlike SCUD, CIPARSim divides the look-up table into smaller sets. Therefore, binary search needs to search a small set of elements to find the requested memory block quickly. Once the memory block is found in the lookup table, fast bit operations are performed to determine cache hits and misses. Bit arrays not only helped to reduce simulation time, they also made CIPARSim space generous. Like SCUD, CIPARSim uses look-up table and simulation tree; however, CIPARSim's space consumption is almost 55% less than SCUD as bit arrays are used in look-up table entries. DEW consumes much less space than SCUD and CIPARSim; however, its space generous data structure does not allow it 
VI. CONCLUSION
To assist in the single-pass simulation of FIFO caches, a new kind of cache property called "Intersection property" has been A data cache optimization system for application processor cores and its experimental evaluation. In IEICE Technical Report, VLD2006-122, various FIFO cache configurations. ICD2006-213, 2006 vić, and S. Parameswaran. Finding optimal l1
