General trends in computer architecture are shifting more towards parallelism. Multicore architectures have proven to be a major step in processors evolution. With the advancement in Multicore architecture researchers are focusing finding different solutions to fully utilize the power of multiple cores. With ever increasing number of cores on a chip, the role of cache memory has become pivotal. An ideal memory configuration should be both large and fast, however in fact system architects have to strike a balance between the size and access time of the memory hierarchy. It is important to know the impact of a particular cache configuration on the throughput and energy consumption of the system at design time. This paper presents an enhanced version of previously proposed cache energy and throughput models for multicore systems. These models use significantly a smaller number of input parameters as compared to other models. This paper also validates the proposed models through cycle accurate simulator and a renowned processor power estimator. The results show that the proposed energy models provide an accuracy within a maximum error range of 10% for single core processors and around 5% for MPSoCs, and the throughput models result in maximum error of up to 11.5% for both single and multicore architectures.
INTRODUCTION
Cache memories are an integral part of most modern processor architectures. For a processor architect choice of components such as cache size and associativity, pipeline depth, number of cores, instruction set design is a critical decision to make. In most cases verification methodology based on Transaction Level Modeling (TLM) [4] or virtualized platforms [16] is used to analyze a proposed configuration several times. However, in general these tools and methodologies are unable to evaluate power consumption of a particular configuration. A single configuration can take several hours for complete evaluation. This process is called design space exploration and is often considered a design time, offline technique. On the other hand, in case of energy aware reconfigurable architectures, an early decision is often required to evaluate impact of a particular configuration beforehand [12] [23] . In this case lightweight analytical models are often required in order to assist the reconfiguration engine. In either case; i.e. for throughput and energy aware hardware exploration at design time or reconfiguration at run-time it is imperative to gauge the performance of cache architectures so as to evaluate their impact on energy requirement and throughput of the system. This paper presents multicore extension of previously proposed cache energy and throughput models [17] [21] [22] [23] . These models require fewer inputs, obtainable using functional simulations and provide an accurate estimate of timings and energy consumption of the cache architecture. The proposed models analyze the energy and throughput of multicore cache hierarchies per application basis thus providing the hardware and software designer with the feedback vital to tune the cache or application for a given energy budget at both offline or online. This paper extends the state of the art by the following contributions: 1. Multicore extension of the previously proposed models with the integration of Cycles per instructions (CPI) of a system. 2. The models are evaluated over state-of-the-art Intel XEON series processors. 3. The models have been validated by using HP Labs' McPAT (Multicore Power, Area and timing modeling Framework) [14] and MARSS-x86 (Micro Architectural and system simulator for Multicore Processors) [18] Simulators.
The aim of this research is to propose and validate the simplified mathematical models for energy and throughput of multicore, multilevel caches for application in the proposed multicore reconfigurable architecture [24] .
The rest of this paper is divided into five sections. In the following section related work is discussed. The energy and throughput models for multicore cache are introduced in section 3. In section 4 the models are validated using a two-level cache hierarchy in multicore architecture, and the final section presents the conclusion.
RELATED WORK
This section presents the related research in the area of cache performance estimation, its usage for various applications and tools such as full system simulators and virtual platforms.
Basmadjian et al. [2] presented a methodology for estimating the power consumption of multicore processors by the resource sharing and power saving mechanisms. The authors propose component-based power models for multicore processors but used fixed capacitance model for the different components of processors and their approach was not extended for processors consisting of more than four cores. Lee at al. [13] have proposed a performance and power estimation technique (PET) for multicore systems. The scheme is based on accurate performance and power transformation model which predicts the performance and power consumption. Furthermore, it also gives the runtime configuration of multi-threaded applications. The results were compiled on an Intel Q6600 quad-core processor under two different frequency levels. The average estimated error of 2.1%-8.3% and 3.2%-6.5% over the measured data, respectively. Their work was limited to predict power consumption processor and do not determine energy of each component. Kamble et al. in [28] also presented detailed cache energy model. The analytic models for conventional caches were found to be accurate to within 2% error. However, their technique over predicts the power dissipation of low power caches by as much as 30%.
Dev at al. [4] devised post-silicon power mapping and modeling of multicore processors by using infrared imaging and performance counter measurements. An accurate finite element model that relates power consumption to temperature has been devised, along with compensating for the artifacts brought together by using infrared-transparent heat removal techniques. A standard numerical technique has been proposed to accurately translate thermal maps for heat sink system. Also, the designers formulated precise empirical models that estimate the infrared-based per-block power maps by means of the PMC measurements. These PMC models exactly estimate the transient power consumption of different processor blocks for that the SPEC CPU2006 benchmarks has been used. Kasichayanula et al. [11] proposed an idea of identifying power consumption accurately by developing Activity-based Model for GPUs (AMG). The core idea handled here is real-time power consumption, which is done by accurately estimated by using NVIDIAs Management Library (NVML). Model validation is done using Kill-A Watt power meter. The authors have claimed that the results are accurate within 10%. The models presented in their work holistically analyze the embedded system power and do not estimate energy consumption for individual components of a processor.
Pricopi et al. [20] proposed a software-based modeling technique for multiple types of cores, which can accomplish performance estimation and power consumption of workloads. The evaluation of the estimation framework technique was done on real asymmetric multicore ARM big.LITTLE asymmetric multi-core platform [6] . The model predicts the power performance behavior of an application on a target core, given all the specifications on run time. Whereas, the cores share the similar ISA but have heterogeneous microarchitecture. However, the work does not address scalability for multi-threaded applications.
Lim et al. [15] proposed a set of equations to estimate accurately worst-case time analysis (WCTA) for RISC processors. Their models include the details of the pipelining, instruction cache and data cache effects on real timeliness of the system. But the size of the program for analysis is still limited.
Taha et al. [25] presented an instruction throughput model of Superscalar processors. Their model includes parameters such as superscalar width, depth of pipeline, instruction fetch mechanism (in-order/out-of-order), branch predictor, central issue window width, number of functional units their latencies and throughputs, re-order buffer width and cache size and latency etc. Their model resulted in errors up to 5.5% when compared to the Simple Scalar simulator [1] . Wada et al. [26] proposed detailed circuit level analytical access time model for on-chip cache memories. The model takes inputs such as number of tag/data array per word/bit line etc. On comparing with SPICE results the model gives 20% error for an 8ns access time cache memory.
Yourst [27] developed PTLsim (A full system clock accurate simulator) to simulate each component at instruction level. This simulator features the configurable RTL level architecture and pipelines at the speed of host system. MARSS-x86 [18] is a cycle accurate complete system simulator for x86 and x64 based architectures, especially for multi-core hardwares. MARSS-x86 extends the functionality and support of PTLsim including complete user space simulations, unmodified software and OS stack and unmodified kernel. For Power consumption estimation for complete system, a tool named McPAT is developed by Li et al. [14] at HP Labs. This tool supports power estimation for various architectures including caches, NOC, multiprocessor, in-order, out-of-order, shared caches and integrated memory. The power consumption estimation is done at circuit level hence it is closer to the real system.
The following section presents the proposed cache energy and throughput models that can be used to get an accurate energy consumption and throughput estimates of a multicore architecture.
THE CACHE ENERGY AND THROUGHPUT MODELS
This section presents the energy and throughput models for a two-level cache hierarchy for multicore architectures.
Energy Models If ,
, and 2 is the energy consumed by instruction, data and level 2 (L2) cache operations, is the Energy consumed by the instructions which do not require data memory access, CPI is the number of cycles per instruction and the leakage energy of the processor. In the previous work CPI was considered to be 1, however in real-time scenarios it could vary depending on various parameters such as branching, predictions, parallelism and no. of cores per chip. CPI directly affects the energy consumption as shown in model below. The total energy consumption of the code in Joules [J] can be defined as, denote the L2 cache's instruction fetch, data read and data write transactions respectively. The processor's per cycle energy consumption is denoted by , − , − , − and − denote the read/write miss penalty (in terms of number of cycles) and their corresponding miss rates. The energy consumed in L2 cache to data and code memory is denoted by 2 → and 2 → that could also be calculated by multiplying the number of memory accesses with their read and write cycles energy.
The idle mode leakage energy of the processor ( ) can be calculated as = .
Where
[Sec] is the total time for which processor was idle.
Throughput Models
If , , and 2 is the time taken in instruction, data and level 2 (L2) cache operations, and the time taken in execution of cache access instructions [Sec], − , − and − the time taken in read, write and miss penalty for cache x; then the total time taken by an application could be estimated as
is the time taken per cache read and write cycle and is the processor cycle time in seconds [sec].
MODEL VALIDATION

Simulation Setup
To validate the accuracy of the proposed models, MARSS-x86 [27] was used to run a number of benchmark applications from SPLASH-2 [29] bench-marking suite (see Table 1 .) Three different type of Intel XEON Processors were used for evaluation purpose i.e. a Single Core XEON Foster [10] , a dual core XEON E5503 [8] , and a quad core XEON E5507 [14] . The parameters for each processor are mentioned in Table 2 . The cache energy and throughput models discussed in section 3 require parameters such as − , − , − , − , that were obtained using HP labs' CACTI that is an integrated cache timing, power, and area model tool [3] (see Table 3 .) It is to be noted that MARSS-x86 provides cycle accurate simulation and timing information whereas for power estimation some external tool is required. HP labs has developed one such tool called, McPAT (Multicore Power, Area, and Timing) integrated power, area, and timing modeling framework for multithreaded, multicore, and manycore architectures was used for estimating Energy of various XEON processor models [14] . McPAT accepts simulation results from MARSS-x86 and then provides accurate power consumption estimates for a particular processor model.
Results
The energy model results for all the three XEON processor models are shown in Figure 1 . The energy models for XEON Foster platform resulted in an error up to 10% in case of Ocean benchmark application whereas a minimal error of around 0.5% is observed in case of Barnes (see Figure 1 (b) ). In case of multicore configurations maximum errors of up to 5% and 3% were observed for XEON E5503 and E5507 processors respectively (see Figure 1 (d,e) ). Figure 2 (a,b) show a comparative analysis of the throughput calculated from the presented throughput model (Predicted Throughput) and simulated throughput for XEON Foster Series (Single Core). Whereas Figure 2 (c,d) and Figure 2 (e,f) show the results for XEON E5503 (Dual Core) and XEON E5507 (Quad Core) respectively. The throughput models for Single Core XEON resulted in a maximum error up to 11.5% in case of FMM application, whereas a minimum error of around 3% is observed for Water-Spatial benchmark (see Figure 2 (b). For the dual-core and quad-core models a maximum error of up to 8.5% and 11.5% is observed for Water-Spatial and Ocean applications respectively (see Figure 2 (d,f)).
It can be observed that the proposed models are able to estimate the energy and throughput of a multilevel cache hierarchy for both single core and multicore systems. The data obtained from CACTI [3] can be calculated by the same tool and stored in a look table, and the models can be used at runtime to estimate the effect of a cache on systems' throughput and performance. This scheme can be used for systems that support dynamic reconfiguration of memory system to make an early decision on cache sizing for a particular application in execution. One such example of the system is proposed by Qadri et. al. [24] .
CONCLUSION
In this paper multicore extension of previously presented cache energy and throughput models were presented. The models require a significantly smaller number of parameters as compared to the existing methods discussed in the related work. Moreover, these parameters can be easily obtained using the techniques adopted in the validation of the models. The models were validated with a two level cache model of XEON Foster, E5503 and E5507 processors, using standard benchmark applications and simulation tools. The cache energy models results were found to be only up to 10% deviated for XEON Foster, whereas for XEON E5503 and E5507 the error was 5% and 3% respectively; when compared with the simulators. Whereas for cache throughput models a maximum error of up to 11.5% is observed for both XEON Foster, and E5507. In the future work these models will be applied in real-time adaptive memory systems, where an accurate estimate of throughput and energy consumption for cache is required for reconfiguration purpose.
