280 research outputs found

    Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation -- without executing it -- is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.Comment: Submitted to PMBS1

    Dynamically Reconfigurable Active Cache Modeling

    Get PDF
    This thesis presents a novel dynamically reconfigurable active L1 instruction and data cache model, called DRAC. Employing cache, particularly L1, can speed up memory accesses, reduce the effects of memory bottleneck and consequently improve the system performance; however, efficient design of a cache for embedded systems requires fast and early performance modeling. Our proposed model is cycle accurate instruction and data cache emulator that is designed as an on-chip hardware peripheral on FPGA. The model can also be integrated into multicore emulation system and emulate multiple caches of the cores. DRAC model is implemented on Xilinx Virtex 5 FPGA and validated using several benchmarks. Our experimental results show the model can accurately estimate the execution time of a program both as a standalone and multicore cache emulator. We have observed 2.78% average error and 5.06% worst case error when DRAC is used as a standalone cache model in a single core design. We also observed 100% relative accuracy in design space exploration and less than 13% absolute worst case timing estimation error when DRAC is used as multicore cache emulator

    Caching in real-time and embedded systems and Benchmarking the ARM Cortex-M3 and Quark x1000 proccessors

    Get PDF
    The general goal is to compare performance of two processors for the low-end embedded market: Intel Quark x1000 vs. ARM Cortex M3, with special emphasis in the memory hierarchy. To do that, first we will assess the cache potential varying sizes, associativities and line sizes by means of CACTI, a cache modeling tool. Then we will review relevant research literature to conclude about the importance and possibilities of the memory hierarchy in real-time embedded systems. Finally, we will write an specific benchmark suite, using it to test the two referenced processors

    A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels

    Full text link
    It is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic Performance Tuning (iWAPT2014

    Online Cache Modeling for Commodity Multicore Processors

    Get PDF
    Modern chip-level multiprocessors (CMPs) contain multiple processor cores sharing a common last-level cache, memory interconnects, and other hardware resources. Workloads running on separate cores compete for these resources, often resulting in highly-variable performance. It is generally desirable to co-schedule workloads that have minimal resource contention, in order to improve both performance and fairness. Unfortunately, commodity processors expose only limited information about the state of shared resources such as caches to the software responsible for scheduling workloads that execute concurrently. To make informed resource-management decisions, it is important to obtain accurate measurements of per-workload cache occupancies and their impact on performance, often summarized by utility functions such as miss-ratio curves (MRCs). In this paper, we first introduce an efficient online technique for estimating the cache occupancy of individual software threads using only commonly-available hardware performance counters. We derive an analytical model as the basis of our occupancy estimation, and extend it for improved accuracy on modern cache configurations, considering the impact of set-associativity, line replacement policy, and memory locality effects. We demonstrate the effectiveness of occupancy estimation with a series of CMP simulations in which SPEC benchmarks execute concurrently on multiple cores. Leveraging our occupancy estimation technique, we also introduce a lightweight approach for online MRC construction, and demonstrate its effectiveness using a prototype implementation in the VMware ESX Server hypervisor. We present a series of experiments involving SPEC benchmarks, comparing the MRCs we construct online with MRCs generated offline in which various cache sizes are enforced via static page coloring
    • …
    corecore