Search CORE

280 research outputs found

Microarchitecture-Independent Cache Modeling for Statistical Simulation

Author: De Bosschere Koen
Eeckhout Lieven
Genbrugge Davy
Publication venue: Academia Press
Publication date: 01/01/2006
Field of study

Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

Author: Bientinesi Paolo
Peise Elmar
Publication venue
Publication date: 01/01/2014
Field of study

Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation -- without executing it -- is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.Comment: Submitted to PMBS1

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Dynamically Reconfigurable Active Cache Modeling

Author: Barzegar Ali
Publication venue
Publication date: 14/01/2014
Field of study

This thesis presents a novel dynamically reconfigurable active L1 instruction and data cache model, called DRAC. Employing cache, particularly L1, can speed up memory accesses, reduce the effects of memory bottleneck and consequently improve the system performance; however, efficient design of a cache for embedded systems requires fast and early performance modeling. Our proposed model is cycle accurate instruction and data cache emulator that is designed as an on-chip hardware peripheral on FPGA. The model can also be integrated into multicore emulation system and emulate multiple caches of the cores. DRAC model is implemented on Xilinx Virtex 5 FPGA and validated using several benchmarks. Our experimental results show the model can accurately estimate the execution time of a program both as a standalone and multicore cache emulator. We have observed 2.78% average error and 5.06% worst case error when DRAC is used as a standalone cache model in a single core design. We also observed 100% relative accuracy in design space exploration and less than 13% absolute worst case timing estimation error when DRAC is used as multicore cache emulator

Concordia University Research Repository

Caching in real-time and embedded systems and Benchmarking the ARM Cortex-M3 and Quark x1000 proccessors

Author: Pueyo Ramón Pablo
Vaughan John
Publication venue: 'Universidad de Zaragoza'
Publication date: 01/01/2015
Field of study

The general goal is to compare performance of two processors for the low-end embedded market: Intel Quark x1000 vs. ARM Cortex M3, with special emphasis in the memory hierarchy. To do that, first we will assess the cache potential varying sizes, associativities and line sizes by means of CACTI, a cache modeling tool. Then we will review relevant research literature to conclude about the importance and possibilities of the memory hierarchy in real-time embedded systems. Finally, we will write an specific benchmark suite, using it to test the two referenced processors

Repositorio Universidad de Zaragoza

A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels

Author: Bientinesi Paolo
Peise Elmar
Publication venue
Publication date: 01/01/2014
Field of study

It is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic Performance Tuning (iWAPT2014

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Multilevel cache modeling for chip-multiprocessor systems

Author: Gregorio J.-A.
Prieto P.
Puente V.
Publication venue
Publication date: 01/01/2011
Field of study

Open Access Repository

Online Cache Modeling for Commodity Multicore Processors

Author: Waldspurger Carl
West Rich
Zaroo Puneet
Zhang Xiao
Publication venue: CS Department, Boston University
Publication date: 01/01/2010
Field of study

Modern chip-level multiprocessors (CMPs) contain multiple processor cores sharing a common last-level cache, memory interconnects, and other hardware resources. Workloads running on separate cores compete for these resources, often resulting in highly-variable performance. It is generally desirable to co-schedule workloads that have minimal resource contention, in order to improve both performance and fairness. Unfortunately, commodity processors expose only limited information about the state of shared resources such as caches to the software responsible for scheduling workloads that execute concurrently. To make informed resource-management decisions, it is important to obtain accurate measurements of per-workload cache occupancies and their impact on performance, often summarized by utility functions such as miss-ratio curves (MRCs). In this paper, we first introduce an efficient online technique for estimating the cache occupancy of individual software threads using only commonly-available hardware performance counters. We derive an analytical model as the basis of our occupancy estimation, and extend it for improved accuracy on modern cache configurations, considering the impact of set-associativity, line replacement policy, and memory locality effects. We demonstrate the effectiveness of occupancy estimation with a series of CMP simulations in which SPEC benchmarks execute concurrently on multiple cores. Leveraging our occupancy estimation technique, we also introduce a lightweight approach for online MRC construction, and demonstrate its effectiveness using a prototype implementation in the VMware ESX Server hypervisor. We present a series of experiments involving SPEC benchmarks, comparing the MRCs we construct online with MRCs generated offline in which various cache sizes are enforced via static page coloring

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)