I. INTRODUCTION
Cache memory was introduced to reduce the large speed gap between the processor and main memory [1] [2] . Cache memory stores frequently accessed data or instructions nearest to the processor. This method allows fast access and reduces the frequency of accessing slower memory [3] . Furthermore, this method improves processor performance because programs exhibit the principle of locality. The principle of locality is divided into spatial and temporal localities [4] .
Fuzzy logic has been successful applied in the control system, universal approximator and pattern recognition fields [5] [6] [7] [8] . In computer architecture, Hossain et al. [9] proposed using the fuzzy replacement algorithm (FRA) as cache memory replacement policy. Their improved work has shown promising results and has opened up a new field for exploration [10] .
The adaptive neuro-fuzzy inference system (ANFIS) in the cache memory replacement policy proposed by Chung and Halim [11] attempts to combine the properties of least recency used (LRU) and least frequency used (LFU) into a knowledge base. This knowledge base is used to improve replacement decisions. Similarly, Donghee et al. [12] and Megiddo and Modha [1] also exploited recency and frequency of reference for improved replacement decisions. However, despite these attempts, the performance of ANFIS has not improved at level 1 (L1) over LRU [11] . Furthermore, previous studies only evaluated the performance of ANFIS at a single cache configuration, which does not provide a sufficiently detailed performance evaluation of ANFIS.
In this research, we propose an improved version of the ANFIS replacement policy called improved ANFIS Replacement Policy (iANFIS-RP). A new set of data are used to train iANFIS-RP in order to obtain a more accurate model. The iANFIS-RP Outputs in the set are updated when a miss occurs. The performance of iANFIS-RP is evaluated at 4 to 256 kB L1 cache with LRU and ANFIS as comparison. This offers a detailed performance evaluation of iANFIS-RP and ANFIS at various cache sizes.
The remainder of this paper is organized as follows. Section II presents the modifications done on ANFIS for improvements. Section III discusses data generation, iANFIS-RP training on MATLAB, implementation of iANFIS-RP as a replacement policy in Sim-outorder and simulation experiment setup. Section IV presents the results and discussion of iANFIS-RP training and simulation experiments. Lastly, Section V concludes this paper.
II. MODIFICATION

A. ANFIS
The adaptive neuro-fuzzy inference system (ANFIS) cache memory replacement policy proposed by Chung and Halim [11] is a two-input, one output Takagi-Sugeno fuzzy inference system with recency and frequency reference as inputs. Recency of reference refers to how recently the block has been 978-1-5090-0925-1/16/$31.00 ©2016 IEEE referenced in cache memory. This parameter is updated after a hit occurs in the cache memory. Frequency of reference refers to how many times the block has been referenced in the cache memory. The output of ANFIS is the ANFIS Output which is used to determine which block to be replaced when a miss occurred. When a hit occurs on a block, ANFIS Output for that block is computed using the inputs and a six-rule knowledge base. The cache block in a set with the highest ANFIS Output is selected for replacement. As shown in Chung and Halim [11] experiment results, ANFIS performed worst at L1 cache memory compared to LRU.
B. iANFIS-RP
In the iANFIS-RP, there are a few modifications has been made on ANFIS to improve the performance at L1 cache memory.
• The previous ANFIS was trained using a modified data set using a shown in Chung and Halim [11] . The trained ANFIS model has a training error of 0.069606 when using the modified data set.
A non-modified data are used to trained iANFIS-RP with 63 data pairs.
• The iANFIS-RP model uses three triangle membership functions (MFs) for each input, namely recency and frequency. The previous ANFIS model uses two MFs for recency input and three MFs for frequency input.
• The previous ANFIS updated the ANFIS Output of a block when a hit occurs. This creates a problem where the ANFIS Output is always low value because the block is always the most recently used. In addition, the ANFIS Output values in a set may also be identical which is problematic when ANFIS is making the replacement decision. As a result, ANFIS replacement policy evicts blocks they may have recently been used.
In order to eliminate the problem, all the iANFIS-RP Outputs associated with a block in a set must be updated when a miss occurs. This gives a proper ranking among the blocks in a set and enables iANFIS-RP to identify the most suitable block for replacement. Under this new modification, the recency and frequency values are updated every time a hit occurs.
With these changes, we propose a two-input and one output Takagi-Sugeno fuzzy inference system for iANFIS-RP. The iANFIS-RP has three MFs for each input and the output MFs are constant. It is a nine rule knowledge base of fuzzy inference system. iANFIS-RP is adaptive due to offline training. The purpose of training is to adjust the membership functions values. The adjustment makes iANFIS-RP more accurate in determining the replacement decision. In implementation of iANFIS-RP, the trained iANFIS-RP model with adjusted membership functions are used as replacement policy at L1.
III. METHODOLOGY
A. Data generation
Simulation points obtained from Ganesan et al. [13] are used to improve the accuracy of the simulation without running full simulation on the benchmark which could take months or weeks to complete [13] . The simulations point is the number of instructions required to fast forward before the actual simulation is run for 100 million instructions [14] . These simulation points are used to generate traces; which are used for data generation and simulation experiments. Each simulation point is associated with weight value which is used to compute the final miss ratio.
SPEC CPU 2006 is a suite of benchmarks used to evaluate the performance of computer systems. The suites consist of 30 benchmarks comprising 12 integer and 18 floating benchmarks. Many researches used SPEC CPU 2006 benchmarks to study cache memory [15] [16] [17] [18] . Among 30 SPEC CPU 2006 benchmarks, only eight benchmarks are used in the simulation experiments. hmmer benchmark is randomly selected as the source of data generation. There are eight traces from hmmer. Therefore, one of the traces is randomly selected which is used to generate the data for iANFIS-RP training on MATLAB.
A short simulation experiment using the trace is run for 10 million instructions on Sim-outorder. The Sim-outorder is modified to record the recency and frequency of reference onto a text file every time a miss occurs. The pseudo codes to record this reference information are shown below.
if miss miss_record(cache, set, replace) for Head to Tail if block == replace print address|recency|frequency|1 else print address|recency|frequency|0
When a miss occurs in a set of cache memory, the function miss_record is invoked. The cache refers to the name of the cache selected for data collection. In this research, L1 data cache is selected for data collection because it has more data than the instruction cache. The set represents the set in the cache wherein the miss occurs and replace is the block (block) that is selected for replacement by the replacement policy. LRU is used as replacement policy at all cache levels when conducting the short simulation experiment. The for loop is used to find the replaced block (replace) in the set.
If the replaced block (replace) is found, then Sim-outorder records the memory address (address), recency (recency) and frequency (frequency) and replacement indicator. This block is assigned as one which is a replacement indicator to indicate the block selected for replacement; otherwise the block is assigned as zero. The memory address, recency, frequency and replacement indicator are separated with | to ensure correct extraction of iANFIS-RP training data from the text file.
B. iANFIS-RP training
The generated data are separated into training, checking and testing data. Training data consists of 70% of generated data while checking and testing data consists of 10% and 20% of generated data respectively. Training, checking and testing data are all unique to each other and randomly selected from generated data. The two learning algorithms available in MATLAB are backpropagation gradient descent and hybrid. The hybrid learning algorithm consists of gradient descent and least-square method. The learning algorithm is used to learn new knowledge from the data. This knowledge is used to optimize iANFIS-RP input MF and output MF parameters [19] . iANFIS-RP training is used to map the inputs to the desired output using the given data and learning algorithm. iANFIS-RP checking is done to avoid overfitting occurs [20] . If overfitting occurs, the trained iANFIS-RP model may not be responding well to independent data. As training continues, the training error and checking error start to decrease. Overfitting has occurred when the checking error starts to increase while the training error decreased. The training is stopped before the checking error starts to increase. The testing is done to observe the generalization capability of the trained iANFIS-RP model on a new set of data which not previously used to train and check iANFIS-RP.
In this research, the two above learning algorithms mentioned are used to determine which learning method is most suitable. Furthermore, the number of MFs and type of MF varied to determine the most optimum trained iANFIS-RP model. Fig. 1 shows the flow chart of iANFIS-RP operation on Sim-outorder. When a hit occurs on a block, the recency and frequency value associated with that block are updated. The linked list for recency is updated where the hit block is placed at the top position of the list. The top position of the linked list represents the most recently used block while the last position of the list represents the least recently used block. Then, the recency value associated of each block is updated where block at the bottom of the list is associated with value of zero and incremented by one as the block is moved up in the list. The frequency value is incremented by one. This frequency value is check to ensure it is not more than 15. If so, the frequency value of each block in the set is reduced by half to ensure saturation does not occurred. Then, the cache memory waits for the next read or write operation.
C. iANFIS-RP framework
When a miss occurs, the iANFIS-RP replacement policy is invoked. The replacement policy reads the recency value and frequency value of each block and computes the iANFIS-RP Output value. Then, the iANFIS-RP Output values are used to arrange the linked list where the block at the bottom list has the highest iANFIS-RP Output. The block at the bottom of the list is the candidate of replacement. The block at the top of the list has the lowest iANFIS-RP Output. A pointer is assigned at the bottom of the linked list. This pointer is then used to point the block for replacement. The desired block is bought into the cache memory. Lastly, the cache memory waits for the next read or write operation. The pseudo codes to update recency value when a hit occurs are shown below.
if hit update(set, block, Head) j = associative -1 for Head to Tail block->recency = j jAs shown above, the update is a function to arrange the hit block in a set where the position of arrangement is at the top (Head) of the linked list. Then, the recency value (block->recency) of each block is assigned using the variable j from the top (Head) to the bottom (Tail) of the linked list.
if hit block-> frequency++ if frequency > 15 for Head to Tail right shift frequency by one
The pseudo codes to update frequency value when a hit occurs are shown above. When a hit occurs on a block, the frequency value associated with that block (block>frequency) is incremented by one. Then, the frequency value (frequency) is check to ensure saturation does not occur. In this research, the frequency value variable (frequency) is considered a four-bit counter on Sim-outorder. Therefore, the maximum value that can be stored by this counter before saturation occurs is 15. Saturation is a situation of digital counter where the counter value is all ones is reset to zero when it is increased by one. This is an undesired condition where the most frequently used block has become the least frequently used. By right shifting the counter value associated with each block in a set, this undesired condition can be avoided while maintaining the hierarchy of frequency of reference in that set.
if miss for Head to Tail block>output_iANFISRP=anfis_main(recency, frequency,iANFIS-RP model) bubble sorting linked list assign pointer to the last position in the linked list
The pseudo codes when a miss occurred are shown above. When a miss occurs, the iANFIS-RP Output (block>output_iANFIS-RP) is computed using the function anfis_main giving the recency and frequency values; and using the iANFIS-RP model created at the beginning of the simulation. The anfis_main is a function obtained from MATLAB standalone codes. Next, the linked list is arranged based on output_iANFIS-RP value where the block at the top of list has the block with the lowest iANFIS-RP Output value while the block at the bottom of list has the highest iANFIS-RP Output using bubble sorting algorithm. The block at the last positon of the linked list is assigned a pointer. This pointer is used to find the block for replacement.
D. Experiment setup
The iANFIS-RP performance is evaluated in detail at various cache sizes of split L1 cache memory. Throughout all experiments, an associative of 4 and block size of 32 bytes are fixed. The performance of iANFIS-RP, ANFIS and LRU are evaluated at 4, 8, 16, 32, 64, 128 and 256 kB of L1 cache memory. Eight SPEC CPU 2006 benchmarks are used namely, b-waves, bzip, gobmk, gromacs, hmmer, milc, sjeng and zeusmp. In the simulation experiments, the number of benchmarks used is limited because the purpose of these experiments is not to study the behavior of benchmark on cache memory in detail, but to evaluate the performance of iANFIS-RP, ANFIS and LRU in terms of miss ratio. Simulation experiments are done on Sim-outorder. Simoutorder is an execution-driven simulator which simulate a superscalar, 5-stage pipeline and Alpha based instruction set processor [21] . The default simulation experiment configurations are shown in Table I . Each trace file of benchmark is run for 100 million instructions.
IV. RESULTS & DISCUSSION
A. Selecting the best trained iANFIS-RP model
From data generation, 63 unique pairs of data are generated. Among these data, 44, 6, and 13 pairs of data are selected as training, checking and testing data respectively. The types of MF studied are triangle, bell-shaped and trapezium. The number of MFs is 2 and 3 for all inputs. The number of MFs is limited to 3 because there are not enough data to adjust all the input and output MFs parameters.
As shown in Table II , the value in the bracket at the training column represents the epoch at which the training is stopped. From the two learning algorithms, the training errors are the lowest when the hybrid learning algorithm is used. This shows that the hybrid learning algorithm is better than backpropagation gradient descent. The iANFIS-RP models with three MFs show low training error when the hybrid learning algorithm is used. However, two and three bellshaped MFs of iANFIS-RP models also show low training error. Both are not considered the best choices for the ANFIS model because the time complexity of logarithm function of bell-shaped MFs can create a huge delay in computing iANFIS-RP Output if it is implemented in hardware. This leaves the choice of either using three triangle or three trapezium MFs as the best iANFIS-RP. Both of the training errors are quite low and close to each other to a point that either one of them can be used as the best iANFIS-RP. After considering the hardware counterpart, three triangle MFs is the best choice for the iANFIS-RP model replacement policy. This is because the trapezium MF requires three comparison operations while triangle MF requires only two. This reduces the hardware and time operation of triangle MF. Overall, iANFIS-RP model with triangle MF is faster than the model with trapezium or bellshaped MF. Hence, iANFIS-RP model with three triangle MFs is selected as replacement policy to be implemented on Sim-outorder.
The results of the selected iANFIS-RP training are shown in Fig. 2 and Table II . As shown in Fig. 2 , the training error and checking error are 1.52363e-07 and 4.53548e-08 respectively which are quite low if compared with previous ANFIS. This shows that modification of data which is done from previous ANFIS are not necessary. The average testing error is 1.93e-07. This shows that the trained iANFIS-RP model is highly accurate in mapping the input to output. TABLE IV , iANFIS-RP can outperform ANFIS as high as 99% at 4 kB instruction cache on b-waves and 82% at 32 kB data cache on sjeng. It is clear that the modifications have improved the performance of the previous ANFIS at L1. The performance of iANFIS-RP is similar with LRU because the data generated for iANFIS-RP training on MATLAB are originated from simulation using LRU as replacement policy. Therefore, iANFIS-RP has similar behavior with LRU. However, iANFIS-RP has improvements over LRU at certain cache sizes and benchmarks. For example, iANFIS-RP has 99% improvement over LRU at 4 kB instruction cache on hmmer as shown in TABLE III. While, iANFIS-RP has 7.9% improvement over LRU at 8 kB data cache on zeusmp.
V. CONCLUSION
The modifications suggested in this paper have been successfully applied on iANFIS-RP. The iANFIS-RP model with three triangle MFs and trained with hybrid learning algorithm is selected as replacement policy. Based on simulation results, iANFIS-RP performed better than ANFIS at various L1 cache sizes.
