Computer designers utilize the recent huge advances in Very Large Scale Integration (VLSI) to get Chip Multiprocessor (CMP) by placing several processors on the same chip die. The CMP is the dominant architecture to improve the performance of the current computing systems. However, accessing a shared data by several processors is a primary challenge in CMP. The data consistency must be reached among all memory hierarchies to ensure correct behavior and higher performance. This paper, proposed a CMP with an efficient multilevel cache system, which enhances miss rate and latency (penalty) by designing and implementation of different write policies with two levels of cache. The proposed system is implemented and tested using Hardware Description Language (VHDL) on Altera's FPGA chip. The results show that a combination of write-through without buffer for the first level and write-back for the second level offers a clear improvement on the multilevel cache system performance.
Introduction
The technology revolution in Very Large Scale Integration (VLSI) has enabled today's designers to design and implement Chip Multiprocessor (CMP), where two or more processors with a shared memory are integrated on a single chip [1] . In the next few generations, the number of processors that can be implemented on a single chip will significantly increase. Some famous research centers such as ParLab at Berkeley, UPCRCIllinois, and the Pervasive Parallel Laboratory at Stanford assume that future microprocessors will have hundreds of cores [2] .
Related Work
Although there are several proposed designs for CMPs use the private L1 cache structure of traditional multiprocessors, the organization of on-chip L2 resources is not researched well [7] . [8] is a survey that classifies the organizations of multilevel cache into three types. First, a private hierarchy of caches, this paper follows this type. Second, a private first-level with shared multiport second-level. Finally, a private first-level with bus-based shared secondlevel. Some of the proposed CMP designs implement multilevel cache systems with private hierarchy [7] . Other proposed designs implement systems with private L1 caches and shared L2 caches [9] [10] [11] . Other designs implement cache systems with two private levels and a third shared level among all cores [6, 12] . [13] compares three different approaches for ensuring cache coherence in Multiprocessor System on Chip (MPSoC) architectures, which are snoopy-based, software-based, and OSbased approaches. The comparison results of different snoop-based cache coherency schemes, reveals that these schemes have a strong sensitivity to the cache write policies more than the coherency protocol. Write-back schemes appear to be more efficient by reducing the traffic on the shared bus. Nonetheless, these schemes increase the hardware complexity of the cache-coherency, perform out-of-order eviction of data to the lower memory hierarchies, and unacceptably losses for the data consistency [14] . [15] compares the memory write policies in shared memory multicore system. It compares write-through invalidate protocol with write-back MESI protocol. Its results shows that write-through protocol is a simple and possible solution to maintain coherency. Writethrough performs very well in both execution time and generated traffic. However, other proposed papers in this field say that write-through protocol perform well to ensure data consistent with drawback of higher write traffic as compared to write-back, since writethrough requires synchronous updates for every write [14, 16] . [9] explores a novel approach to mitigating multicore power consumption by using dynamic application memory behavior. The main notice is that using only write-through policy causes more contention on the bus. Moreover, using only write-back policy makes the data in L2 stale relative to the newer data in L1 caches. The proposed solution is to allow a fine-grained hybrid interaction between write-through and write-back policies to improve the overall performance. However, it is become more complex to generalize an inclusive state diagram to manage diverse coherence protocols possessed by each core when the number of the cores is increased [17] . [14] develops and evaluates two consistent write-back caching policies, ordered and journaled, that are designed to perform increasingly better than write-through and traditional write-back policies. Paper's results show that ordered write-back performs better than writethrough. Additionally, journaled write-back exceeds conventional write-back performance.
The Proposed Multilevel Cache System Design 3.1 Multilevel Cache System Block Diagram
As shown in figure 1 , the proposed multilevel cache system consists of two private levels of cache associated with each processor. Each level of cache consists of four main parts. Which are data cache entity, tag array, comparator, and cache controller. The data cache entity is direct-mapped architecture to save the frequent used data in its blocks. Each block has additional bits represent the block state and indicate if the block data is valid or not. The tag array contains an address information, which are required to identify whether the data in the cache blocks corresponds the requested data. The comparator part is used to examine the address tag field with the associated tag bits in the cache. The cache controller is used to check the comparator results and the block state bits value, generate the hit/miss signal, and serve the request from the cache blocks or from the lower memory hierarchies.
Cache Coherence Protocol
The selected cache coherence protocol to be applied on the proposed cache system is the MESI [4] , which is illustrated in figure 2 and the explanation of the abbreviation used in MESI state diagram are listed in table 1. Since there is no communicate between the L1 caches via the shared bus, the coherence protocol must be applied on the L2 caches which are in touch with the shared bus. L1 cache takes the required data the block state from L2 cache [5] . 
Write Policies and Hit-Miss Actions

First-Level Cache
 Write Policy: The write policy used in L1 cache is write-through, where writes always update both the cache and the lower level of the hierarchy (either another cache level or main memory) at the same time. Using write-through policy insures that the data is always consistent between cache levels.  Hit State: L1 cache controller monitors every request from the processor. When there is a request (Read/Write), the comparator examines the address, the cache controller checks the comparator results to generate hit or miss signal and serve the request. According to the MESI protocol, hit occurs in one of the following states:  For read request: when the tags are equal and the block state is S, E, or M. Then the requested data is placed on the data bus.  For write request: when the tags are equal and the block state is E or M. Then the data on the data bus is written to the corresponding cache block, the block state is updated to M if it was E, and the Writethrough signal is asserted to write the same data to L2 cache at the same clock.  Miss State: When there is no hit, miss occurs and the Inhibit signal in figure 1 is asserted to freeze the processor datapath during the miss manipulation. If the L1 block has a dirty data belongs to other address, the cache controller asserts the EndInclusion signal and sends the dirty block address to L2 to inform it that this block will not be exist in L1. Then the cache controller asserts the DataReq signal and sends the requested data address to L2 cache. When the requested data is ready, L1 cache puts it with its state in the corresponding block and asserts the Aknow signal.
Second-Level Cache
 Write Policy  The write policy used in L2 cache is write-back, which handles writes by updating values only to the block in the cache, the modified block is writing to the lower level of the hierarchy (the main memory) in two states. First, when it is replaced. Second, when it is invalidated by another processor. Using write-back policy is used to reduce the bus traffic and thus allows more processors on a single bus. During the write process, a write-back buffer is used to allow the processor to continue as soon as possible on a cache miss that causes a write-back. It is better to delay the write-back and instead first service the miss that caused it [4] . Moreover, a cache to cache transfer is supported in the proposed architecture in order to minimize the miss latency. Hit State :L2 cache controller monitors every request from the L1 cache and from the bus side. When there is a request from L1 cache (Writethrough/EndInclusion), the request is served immediately since such requests cannot be happened if the data block does not exist in L2 cache. When L1 cache request is DataReq, L2 cache first comparator examines the incoming address tag field with the associated tag bits in the cache. The cache controller checks the comparator results and the state bits value to generate the hit/miss signal and serve the request. Hit occurs in one of the following states:  The request is done to perform a read process in L1: when the tags are equal and the block state is S, E, or M. Then the requested data is placed on the intermediate data bus.  The request is done to perform a write process in L1: when the tags are equal and the block state is E or M. Then the requested data is placed on the intermediate data bus.
Miss State
When miss occurs, five steps are performed. First, asserting BusReq signal. Second, waiting for bus Grant. Third, putting address and command on the bus. Fourth, waiting for one of the acknowledgment signals (CmdReceive, MDataRdy, and StrRec). Finally, transferring the data between the shared bus, L2, and L1 at the same clock to reduce the miss penalty. 
 Snooping Actions
Since the cache coherence protocol is applied in L2 cache because of its contact with the shared snoopy bus, L2 cache controller monitors (snoop) every transaction on the shared bus. The second comparator checks if the bus transaction belongs to one of its contents. Then the cache controller takes the comparator results and checks the block state bits to decide which action must be applied. The DataAvailable/WBRequest signals have the benefit of reducing the miss latency time. It informs the bus arbitration that the cache has an up-to-date (S or E / M) data, and it can supply the request with it. When there are many cache on different processors assert this signal, bus arbitration will choose one of them to supply the data by asserting StrSend signal to the chosen processor and StrRec with CmdReceive signals to the requester processor, this action is better than bring the requested data from the main memory , which may take more time. Also, WBRequest informs the bus arbitration to write the data to the main memory through the write-back buffer.
If the requested data is not exist in other caches blocks, the request will be supplied from the main memory. The bus arbitration will read the requested data and asserts MDataRdy with CmdReceive signals to the requester processor when the memory data is ready on the bus.
The Proposed CMP With Multilevel Cache System Implementation
To validate the proposed multilevel cache system, a CMP with four pipelining MIPS processors is built. Each processor has two private levels of data cache as shown in figure 3 . A shared bus connects the whole system components with each other and with the off-chip main memory. The management of the shared bus is the responsibility of a central bus arbitration unit. The CMP with the proposed multilevel cache system is implemented using VHDL language on Altera CycloneII (EP2C35F672C6) FPGA. Moreover, many scenarios of multilevel cache system with different write policies and a single level cache system have been designed and implemented. Figure 4 shows the top level Register Transfer Level (RTL) synthesis for the proposed cache system. 
Journal of
Result and discussion
This section compares between the scenarios that are implemented to show the appropriate architecture for the multilevel cache system. These scenarios are:  Cache system with write-back L1 cache and write-back L2 cache.  Cache system with write-through (via buffer) L1 cache and write-back L2 cache.
 The proposed cache system with write-through (without buffer) L1 cache and write-back L2 cache.
 Cache system with only L1 cache. The results are obtained from a VHDL simulation tool from Mentor Graphics Company, which is ModelSim. ModelSim has an ability to illustrate the simulation results as waveform which is an easy way to recognize the required results. The comparison factors are the miss rate, the miss latency (penalty) and the correct behavior of the system. Many tests, such as read/write sequence of data from the main memory and other processors' cache system, read and modify two blocks that are mapped to the same cache block from the main memory, have been applied on the four scenarios to verify them. Table 4 , summarizes the implemented cache system test's result. The simulation results involve the required time transferring data through the bus. From the results, the proposed cache system improves the latency by 10% compare to the first and the second scenario. The fourth scenario is better than the other scenarios. This result is logically correct since the fourth scenario has only one level of cache. However, the fourth scenario suffers from high miss rate, which is solved by the second level of cache on the other scenarios. The required time to complete the write process to both levels is improved in the proposed system by 50% compare to the first scenario and by 66% compare to the second scenario. Test results show that when write-back policy is used between L1 and L2 cache, the data consistency cannot be insured between all processors' cache systems. For example, suppose that a requested data brought to the cache system from the main memory. This data placed in L2 and L1 caches with (E) states. Then if the processor needs to modify that data, the new data will be modified only in L1 cache and the corresponding block state will be modified to (M) state in L1 cache, too. When other processor requests the same address, L2 controller in the owned processor will snoop that request. Then it will change its block state to (S) state and supply the request with a false data since the up-to-date data exists in L1 cache. In addition, using write-through policy with buffer may cause a conflict between writethrough address and the data request address in some cases, such as the following worst case example. Suppose that the memory blocks with addresses 255, 127 which are mapped to the same L1 cache block. In the beginning, both blocks are not exist in the processor's cache system. When the processor request address 255 to perform a write instruction. The request is done and the data is placed in L1 and L2 with (E) or (M) states according to MESI protocol. Then the new data write process is performed to L1 cache and the write-through buffer. During write-through process, written data address is sent to L2 cache to perform the write process. If the processor request address 127, that address will be sent to L2 cache because of the missing station in L1 cache. As long as write-through is not complete, the write-through address conflicts with the data request address. This confliction causes an ambiguity in L2 cache which may provide L1 cache with a false data. Solving these problems, by using more checking circuits or splitting the address bus, can reduce the overall performance through increasing the miss penalty, the design complexity, and the overall chip area.
Other scenarios' problems are solved in the proposed scenario by modifying the write techniques as described in section 3. The simulation results show that the same cases are performed on the proposed system without any confusion. The proposed cache system write the new data to the both levels in the same clock cycle. Moreover, writing the new data to L2 cache is done before writing it to L1 cache by half of clock cycle. That action makes the new data available to others cache system immediately. 
Conclusion and Future Work
In this paper, a CMP with multilevel cache system is designed and implemented using VHDL and FPGA. Using multilevel cache improves the overall performance of CMP. The two-Level directmapped data cache with a snoopy MESI cache coherency protocol is designed and implemented for each processor. The simulation and implementation results show that when the write-through policy without buffer in L1 cache is combined with the write-back policy in L2 cache, is the best combination to get the benefits of multilevel cache. Actually, this combination improves the miss latency by at least 10% compared to other architecture. This paper can be extended by examining the effects of using more flexible placement techniques such as the set-associative to reduce the cache miss.
