In this project, we investigate the fluctuations in performance caused by changing the Instruction (I-cache) size, and the Data (D-cache) size in L1 cache. We employ the Gem5 framework to simulate a system with varying specifications on a single host machine. We utilize the FreqMine benchmark available under the PARSEC suite as the workload program to benchmark our simulated system. The Out-order CPU (O3) with Ruby memory model was simulated in a Full-System X86 environment with Linux OS. The chosen metrics deal with Hit Rate, Misses, Memory Latency, Instruction Rate, and Bus Traffic within the system. Performance observed by varying L1 size within a certain range of values was used to compute Confidence Interval based statistics for relevant metrics. Our expectations, corresponding experimental observations, and discrepancies are also discussed in this report.
INTRODUCTION
Most of the modern fast CPUs poses multiple levels of CPU caches. Earlier CPUs with Cache had only one level of cache, later, level 1 caches is split into L1D for data and L1I for instructions. We believe that it is important to examine the cache size of L1I and L1D due to these reasons: (i)The data that's stored in L1D is different from from L1I. L1I poses not only the instructions, but also annotations such as the next instruction start location, to help out the decoders. "Trace Cache" is used in some processors that stores the result of decoding an instruction rather than storing the original instruction in its encoded form. (ii) Having L1D and L1I separately aids the overall circuitry, otherwise it would be expensive to have self-modifying code. L1D concerns with read and write operations, while the L1I concerns only the read operation. Here, the write operation goes through L1D to L2 cache, then the line in L1I is invalidated and reloaded from L2. This is an efficient way to access the memory instead of overwriting the data in L1I directly. (iii) As most of the modern processors can read data from L1D and L1I simultaneously, they suffer from queues at the cache entrance. Having separate caches increases the overall bandwidth such that in a given cycle two reads and one write operations can be performed. (iv)With separate caches, they can power up the circuitry separately for instructions and data, increasing the chances of a circuit remaining un-powered during any given cycle. It is essential to save power; as it is required to the memory cells themselves to maintain their contents, some processors do power down some of the associated circuitry decoders when not in-use.
The performance impact of adding an L1 cache is directly related to its efficiency or hit rate, and repeated cache misses can have a catastrophic impact on overall CPU performance. Since L1 sits at the top of cache hierarchy, we see the problem of minimizing cache misses at the L1 level particularly important for references to levels below and for the execution speed as a whole. Hence, we investigate the effect of changing the size of L1I -L1 Instruction and L1D -L1 Data Caches in full system simulation on an X86 architecture. We aim to use the Freqmine benchmark workload program from the PARSEC suite which fall under the Financial Analysis and the Data Mining application domain respectively. We hypothesize that the fluctuations in the sizes of L1I -L1 Instruction and L1D -L1 Data caches has a principal effect on the system which rightfully affects the hit rate , the miss rates, and the penalty metrics of the processor. We expect the following changes by changing the L1I and the L1D data caches, the hit rate ideally should increase with the increase in the caches sizes as the hit rate is directly proportional to the cache sizes. Owing to the same logic, we also hypothesize that the miss rate decreases as we increase the size of the caches in the processor system. For a more detailed analysis, we investigated the stats.txt file from the Freqmine workload program benchmark from the PARSEC benchmark suite. Following are our system specifications per the project guidelines for achieving full, as well as extra credit:
• Built and simulated on personal computer instead of using pre-built libraries on shared system Eustis.
• Full System Linux X86 simulation instead of System-call Emulation.
• O3 CPU for out-order, pipelined, and multi-processor execution instead of simpler CPUs.
• Ruby memory model for better flexibility in cache-based systems.
• Used standardized PARSEC benchmark.
• Computed Confidence Interval based statistics after running a huge number of experiments (> 500).
This report is organized as follows: we mention previous works in the next section. Then we briefly describe introduce the previous two phases: PP1 -Options, PP2 -Experiment Design. Later sections focus on the experiment result statistical analysis and the conclusions drawn from this empirical study.
RELATED WORK
Instruction and data caches are well known architectural solutions that allow significant improvement on the performance of high-end processors. Due to their sensitivity to soft errors they are often disabled in safety critical applications, thus sacrificing performance for improved dependability. The work 2 , reports an accurate analysis of the effects of soft errors in the instruction and data caches of a soft core implementing the SPARC architecture. A cache organization 3 essentially eliminates this penalty. This cache organizational feature has been incorporated in a cache interface subsystem design, and the design has been implemented and prototyped. A master-slave cache system has a large, set-associative master cache, and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for supplying data operands to an execution pipeline of the processor. The master cache and the slave caches are tightly coupled to each other. This tight coupling 4 allows the master cache to perform most cache management operations for the slave caches, freeing the slave caches to supply a high bandwidth of instructions and operands to the processor's pipelines. A method 6 for determining a tight bound on the worst case execution time of a program when running on a given hardware system with cache memory. Caches are used to improve the average memory performance, however, their presence complicates the worst case timing analysis. An automatic tool-based approach 5 , to bound worst-case data cache performance. The given approach works on fully optimized code, performs the analysis over the entire control flow of a program, detects and exploits both spatial and temporal locality within data references, produces results typically within a few seconds, and estimates, on average, 30% tighter WCET bounds than can be predicted without analyzing data cache behavior. According to the study 7 , a cache memory may contain contents that are susceptible to corruption. A cache controller, with the use of a threshold timer, may employ various operations to flush modified cache contents into a main memory and invalidate cache contents so that they are overwritten. Some operations include periodically flushing and invalidating the whole cache memory, periodically flushing and invalidating modified contents, and periodically flushing and invalidating contents based on the time saved in the cache memory. By overwriting cache contents that might otherwise be constantly stored in the cache memory, the system minimizes the probability of cache contents becoming corrupt. The periodic updating of the main memory may also increase the probability of successfully recovering from potential cache parity errors while still maintaining high performance associated with using a cache memory.
We follow the steps mentioned in the tutorial 1 to run the simulations on gem5. We build a gem5 binary and run a simulation for the X86 processor in Full System(FS) mode 2 .
• We first get and run the clone.sh script to clone both gem5 and arm-gem5-rsk repositories from the aforementioned url.
$ wget https://raw.githubusercontent.com/arm-university/arm-gem5-rsk/master/clone.sh $ bash clone.sh
• Then, using the tutorial's instructions for ARM, we build gem5 from source but for X86.
$ cd gem5 $ scons build/x86/gem5.opt -j8 # parallel build on 8 host cores.
• We get the X86 full system disk image, expand the image to fit Parsec, and set $M5_PATH:
$ wget http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2 $ tar xvfJ x86-system.tar.bz2
• Next, we install and use FreqMine from the PARSEC Benchmark Suite for bench-marking in FS mode for X86. • Next, we make some edits to gem5's code for it to work with our setup. More specifcally, we update FSConfig.py so that our .img disk image can be read by gem5.
• Finally, we run the simulations for all three: simsmall, simmedium, and simlarge as shown below. The exploration parameters are updated for different experimental setups.
$~/gem5/build/X86/gem5.opt -d ../../gem5/fs_results/trial_freqmine16x16~/gem5/configs /example/fs.py --disk-image=/home/kartik/gem5/x86_fs_img_files/disks/expanded-linux-x86.img --kernel=/home/kartik/gem5/x86_fs_img_files/binaries/x86_64-vmlinux-2.6.22.9 --script=/home /kartik/arm-gem5-rsk/parsec_rcs/freqmine_simsmall_8.rcS --l1i_size="16kB" --l1d_size="16kB" --ruby --cpu-type="DerivO3CPU"
EXPERIMENT DESIGN
In this section, we briefly describe our experimental design to investigate the processor performance based on varying L1-I and L1-D cache sizes, as proposed in our project phase 2.
• CPU Parameters: We used O3 as our pipelined, out-of-order CPU model for the simulation. More specifically, we ran the simulation with Gem5's DerivO3CPU SimObject to simulate this functional unit. We ran experiments with both single, and multiple cores and report our results on the former.
• Memory Model: We used Ruby option in the highly configurable fs.py script by Gem5, due to its fidelity and Cache Coherence flexibility. Ruby accurately models both cache coherence and network related features in the memory system. We experiment with L1I, L1D cache sizes and evaluate the performance of the system. • No of Runs: For each change in the parameters we performed ten runs to average out any effect of randomness. This gives us a total of 5 parameters for L1I * 5 parameters for L1D * 10 runs for each setup * 3 simulation scales (small | medium | large) = 750 experiments to run.
• Runtime: The X86 Freqmine simulations were much faster than our initial ARM simulations which allowed us to run a lot of experiments. Because of the minimal boot overhead from the bare-bones Linux X86 image we used, an average simulation took 45 minutes aprrox., to run. We ran multiple such simulations in parallel to make full use of all the cores at our disposal in the host machine.
The number of cores of the simulation subsystem plays a vital role in the estimated run time for each iteration in changes. At our experiment design phase, we made no change to the parameter range for the L1 instruction cache sizes and the L1 data cache sizes. The number of cores we ran the simulation was just one at that moment. Hence, our estimation was based on the stock factory default settings. Therefore, we hypothesized that increasing the number of cores of the subsystem will lead to decrease in the time it requires for the Freqmine benchmark workload program per run.
EXPERIMENTAL RESULTS
Based on our experimental design, we performed experiments for 25 (5 unique L1D caches sizes and 5 unique L1I caches sizes) * 3 (simulation size -large, medium, and small) in total. Also, the total number of runs for experiments in total is greater than 500. We used an X86 image which takes about 45 minutes approximately for each experiment. Simulation results across varying cache sizes are shown in Figures [1, 2 , 3, 4, and 5].
We also extract metrics from the stats.txt file for a each experiment using a custom written python program. The custom python program parses the root directories of the gem5, sub-directories of gem5, and crawls to search stats.txt then extract the desired metrics and plots them appropriately. We also verify our simulated configurations through the config.ini files. The scripts and results are zipped and uploaded as a submission comment. 
QUANTITATIVE ANALYSIS
The tables [1, 2, 3] represents processor performance across various metrics and runs for large, medium and small simulation sizes respectively. Description for the reported metrics in the tables [1, 2, 3] are as follows:
• IPC -Instruction per cycle
• IER -Instruction Execution Rate Memory Access Latency vs. L1 Cache Sizes Figure 5 . Memory Access Latency (in cycles) per DRAM burst w.r.t. changing I-cache and D-cache sizes in L1. For bigger D-cache, even though more data is moved from memory, fewer access, on average, lead to a shorter latency. For simulations with extremely small D-cache, the X86 system crashes during benchmarking i.e. Freqmine. 
Data Bus Utilization vs. L1 Cache Sizes
0.1% 0.1% Figure 6 . A confidence interval of 95% is approximately within 2 standard deviations of the distribution as shown in online Online Example Standard deviation.
STATISTICAL ANALYSIS
Our statistical significance based computation for a set of observations of any metric is as follows:
Where y i is the i th element of the sample, and y is the sample mean, s is the standard deviation of the sample, and n is the number of samples.
We run all experiments 10 times each, but to our surprise we found no changes in the values of the metrics we measure. Thus, we removed the random number generator seed in the random.cc file and recompiled the entire system, and perform totally random (pseudo-random in theory!) experiments. We still see no changes in the results spanning all the 10 experiments. The deviations (which can be observed in the three tables) are due to simulation scale (small vs. medium vs. large). Representative examples on Hit Rate, and IPC are shown below. The minor variations lead us to conclude that with 95% confidence, true distribution mean is close to our observed mean. Since other metrics also observe minor variations, we argue the true mean lies close to observed values allowing us to skip CI computations for the rest of them. These calculations can be verified using the online CI calculator 3 .
3 Online Confidence Interval Calculator
CONCLUSION
We perform a myriad of simulations (> 500 ) on three simulation scales -small, medium, and large to draw the following inferences and observations:
• Contrary to our expectations, changing L1I cache size has no effect whatsoever on any relevant metrics for Freqmine simulation at any scale. We hypothesize that this is the case because Freqmine is a data-intensive application. Our experiments covered an ultimately wide range of I-cache sizes from 1 KB to 1024 KB allowing us to make us confident inferences from our observations.
• Size of L1D cache is vital to performance of Freqmine simulations in X86 system and is directly proportional to Hit Rate which aligns with our expectations. Since, this benchmark shows high spatial locality for data with quick temporal reuse of instructions (i.e. as highly parallel loops are common in data-mining applications like Freqmine), increase in L1D also corresponds to a higher IPC.
• We observed that if the size of L1D cache is lower than a certain threshold (64 KB in our case) the X86 system crashes during simulation. Hence, we believe that the importance of data cache size trumps the importance of instruction cache size for our use case i.e. data-mining application such as Freqmine.
• Even after going through extreme lengths to introduce randomness, we observed that the performance stays constant for multiple runs when all parameters are fixed. The only variation we observe is when the simulation scale is changed (small vs. medium vs. large) which allowed us to compute confidence interval based statistics.
ACKNOWLEDGMENTS
We would like to thank Prof. Dan Marinescu and Dr. Debashri Roy for their timely, and helpful insights not only for this project but also on overall computer architecture theory.
