25,621 research outputs found

    Software trace cache

    Get PDF
    We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version

    Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors

    Get PDF
    Most modern 16-bit and 32-bit embedded processors contain cache memories to further increase instruction throughput of the device. Embedded processors that contain cache memories open an opportunity for the low-power research community to model the impact of cache energy consumption and throughput gains. For optimal cache memory configuration mathematical models have been proposed in the past. Most of these models are complex enough to be adapted for modern applications like run-time cache reconfiguration. This paper improves and validates previously proposed energy and throughput models for a data cache, which could be used for overhead analysis for various cache types with relatively small amount of inputs. These models analyze the energy and throughput of a data cache on an application basis, thus providing the hardware and software designer with the feedback vital to tune the cache or application for a given energy budget. The models are suitable for use at design time in the cache optimization process for embedded processors considering time and energy overhead or could be employed at runtime for reconfigurable architectures

    Managing Alaska’s Coastal Development: State Review of Federal Oil and Gas Lease Sales

    Get PDF
    Contention for shared cache resources has been recognized as a major bottleneck for multicores—especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.Coder-mpUPMAR

    MarciaTesta: An Automatic Generator of Test Programs for Microprocessors' Data Caches

    Get PDF
    SBST (Software Based Self-Testing) is an effective solution for in-system testing of SoCs without any additional hardware requirement. SBST is particularly suited for embedded blocks with limited accessibility, such as cache memories. Several methodologies have been proposed to properly adapt existing March algorithms to test cache memories. Unfortunately they all leave the test engineers the task of manually coding them into the specific Instruction Set Architecture (ISA) of the target microprocessor. We propose an EDA tool for the automatic generation of assembly cache test program for a specific architectur

    Efficient instruction and data caching for high-performance low-power embedded systems

    Get PDF
    Although multi-threading processors can increase the performance of embedded systems with a minimum overhead, fetching instructions from multiple threads each cycle also increases the pressure on the instruction cache, potentially harming the performance/consumption ratio. Instruction caches are responsible of a high percentage of the total energy consumption of the chip, which for battery-powered embedded devices becomes a critical issue. A direct way to reduce the energy consumption of the first level instruction cache is to decrease its size and associativity. However, demanding applications, and specially applications with several threads running together, might suffer a dramatic performance slow down, or even increase the total energy consumption of the cache hierarchy, due to the extra misses incurred. In this work we introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that replaces the conventional second level cache (L2) and improves the Energy–Delay of the system. We provided iLP-NUCA with a new tree-based transport network-in-cache that reduces both the cache line service latency and the energy consumption, regarding the former LP-NUCA implementation. We modeled in our cycle-accurate simulation environment both conventional instruction hierarchies and iLP-NUCAs. Our experiments show that, running SPEC CPU2006, iLP-NUCA, in comparison with a state–of–the–art high performance conventional cache hierarchy (three cache levels, dedicated L1 and L2, shared L3), performs better and consumes less energy. Furthermore, iLP-NUCA reaches the performance, on average, of a conventional instruction cache hierarchy implementing a double sized L1, independently of the number of threads. This translates into a reduction of the Energy–Delay product of 21%, 18%, and 11%, reaching 90%, 95%, and 99% of the ideal performance for 1, 2, and 4 threads, respectively. These results are consistent for the considered applications distribution, and bigger gains are in the most demanding applications (applications with high instruction cache requirements). Besides, we increase the performance of applications with several threads without being detrimental for any of them. The new transport topology reduces the average service latency of cache lines by 8%, and the energy consumption of its components by 20%

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page
    • 

    corecore