Search CORE

4 research outputs found

Performance evaluations for multicore processors

Author: Marpaung Julius Jonggara R. Hot Marisi
Publication venue
Publication date: 01/05/2012
Field of study

Scope and Method of Study: To use and improve a new simulation tool that emulates and studies different cache hierarchies and configurations to evaluate the performance of any chosen processor and cache configurations.Findings and Conclusions: Sharing a L2 cache with more than eight processors may reduce performance. Using a shared L3 cache or hierarchical architecture may result in a better performance. The major factor that contributes to the loss of performance is the bus contention. Increasing the size of shared cache does not have a significant impact on performance

SHAREOK repository

Recommended from our members

Improving virtual memory performance in virtualized environments

Author: Marathe Yashwant
Publication venue
Publication date: 25/02/2021
Field of study

Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin

Texas ScholarWorks

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Author: Dimitris Kaseridis
Jeffrey Stuecheli
Lizy K. John
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Abstract—As Chip-Multiprocessor systems (CMP) have be-come the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition, the virtualization of these computational resources exposes the system to a mix of diverse and competing workloads. Cache is a resource of primary concern as it can be dominant in controlling overall throughput. In order to prevent destructive interference between divergent workloads, the last level of cache must be partitioned. In the past, many solutions have been proposed but most of them are assuming either simplified cache hierarchies with no realistic restrictions or complex cache schemes that are difficult to integrate in a real design. To address this problem, we propose a dynamic partitioning strategy based on realistic last level cache designs of CMP processors. We used a cycle accurate, full system simulator based on Simics and Gems to evaluate our partitioning scheme on an 8-core DNUCA CMP system. Results for an 8-core system show that our proposed scheme provides on average a 70 % reduction in misses compared to non-partitioned shared caches, and a 25 % misses reduction compared to static equally partitioned (private) caches. I

CiteSeerX

Crossref