1,142 research outputs found

    Software-Based Self-Test of Set-Associative Cache Memories

    Get PDF
    Embedded microprocessor cache memories suffer from limited observability and controllability creating problems during in-system tests. This paper presents a procedure to transform traditional march tests into software-based self-test programs for set-associative cache memories with LRU replacement. Among all the different cache blocks in a microprocessor, testing instruction caches represents a major challenge due to limitations in two areas: 1) test patterns which must be composed of valid instruction opcodes and 2) test result observability: the results can only be observed through the results of executed instructions. For these reasons, the proposed methodology will concentrate on the implementation of test programs for instruction caches. The main contribution of this work lies in the possibility of applying state-of-the-art memory test algorithms to embedded cache memories without introducing any hardware or performance overheads and guaranteeing the detection of typical faults arising in nanometer CMOS technologie

    Towards a theory of cache-efficient algorithms

    Get PDF
    We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity

    Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis

    Get PDF
    Using FPGA-based acceleration of high-performance computing (HPC) applications to reduce energy and power consumption is becoming an interesting option, thanks to the availability of high-level synthesis (HLS) tools that enable fast design cycles. However, obtaining good performance for memory-intensive algorithms, which often exchange large data arrays with external DRAM, still requires time-consuming optimization and good knowledge of hardware design. This article proposes a new design methodology, based on dedicated application- and data array-specific caches. These caches provide most of the benefits that can be achieved by coding optimized DMA-like transfer strategies by hand into the HPC application code, but require only limited manual tuning (basically the selection of architecture and size), are neutral to target HLS tool and technology (FPGA or ASIC), and do not require changes to application code. We show experimental results obtained on five common memory-intensive algorithms from very diverse domains, namely machine learning, data sorting, and computer vision. We test the cost and performance of our caches against both out-of-the-box code originally optimized for a GPU, and manually optimized implementations specifically targeted for FPGAs via HLS. The implementation using our caches achieved an 8X speedup and 2X energy reduction on average with respect to out-of-the-box models using only simple directive-based optimizations (e.g., pipelining). They also achieved comparable performance with much less design effort when compared with the versions that were manually optimized to achieve efficient memory transfers specifically for an FPGA

    An Associativity Threshold Phenomenon in Set-Associative Caches

    Full text link
    In an α\alpha-way set-associative cache, the cache is partitioned into disjoint sets of size α\alpha, and each item can only be cached in one set, typically selected via a hash function. Set-associative caches are widely used and have many benefits, e.g., in terms of latency or concurrency, over fully associative caches, but they often incur more cache misses. As the set size α\alpha decreases, the benefits increase, but the paging costs worsen. In this paper we characterize the performance of an α\alpha-way set-associative LRU cache of total size kk, as a function of α=α(k)\alpha = \alpha(k). We prove the following, assuming that sets are selected using a fully random hash function: - For α=ω(log⁥k)\alpha = \omega(\log k), the paging cost of an α\alpha-way set-associative LRU cache is within additive O(1)O(1) of that a fully-associative LRU cache of size (1−o(1))k(1-o(1))k, with probability 1−1/poly⁥(k)1 - 1/\operatorname{poly}(k), for all request sequences of length poly⁥(k)\operatorname{poly}(k). - For α=o(log⁥k)\alpha = o(\log k), and for all c=O(1)c = O(1) and r=O(1)r = O(1), the paging cost of an α\alpha-way set-associative LRU cache is not within a factor cc of that a fully-associative LRU cache of size k/rk/r, for some request sequence of length O(k1.01)O(k^{1.01}). - For α=ω(log⁥k)\alpha = \omega(\log k), if the hash function can be occasionally changed, the paging cost of an α\alpha-way set-associative LRU cache is within a factor 1+o(1)1 + o(1) of that a fully-associative LRU cache of size (1−o(1))k(1-o(1))k, with probability 1−1/poly⁥(k)1 - 1/\operatorname{poly}(k), for request sequences of arbitrary (e.g., super-polynomial) length. Some of our results generalize to other paging algorithms besides LRU, such as least-frequently used (LFU)

    Applying measurement-based probabilistic timing analysis to buffer resources

    Get PDF
    The use of complex hardware makes it difficult for current timing analysis techniques to compute trustworthy and tight worst-case execution time (WCET) bounds. Those techniques require detailed knowledge of the internal operation and state of the platform, at both the software and hardware level. Obtaining that information for modern hardware platforms is increasingly difficult. Measurement-Based Probabilistic Timing Analysis (MBPTA) reduces the cost of acquiring the knowledge needed for computing trustworthy and tight WCET bounds. MBPTA based on Extreme Value Theory requires the execution time of processor instructions to be independent and identically distributed (i.i.d.), which can be achieved with some hardware support. Previous proposals show how those properties can be achieved for caches. This paper considers, for the first time, the implications on MBPTA of using buffer resources. Buffers in general, and firstcome first-served (FCFS) buffers in particular, are of paramount importance as the complexity of hardware increases, since they allow managing contention in those resources where multiple requests may be pending. We show how buffers can be used in the context of MBPTA and provide illustrative examples.Peer ReviewedPostprint (published version
    • 

    corecore