859 research outputs found

    Optimization of instruction fetch for decision support workloads

    Get PDF
    Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. In this paper, we focus on database applications running decision support workloads. We characterize the locality patterns of ia database kernel and find frequently executed paths. Using this information, we propose an algorithm to lay out the basic blocks for improved I-fetch. Our results show a miss reduction of 60-98% for realistic I-cache sizes and a doubling of the number of instructions executed between taken branches. As a consequence, we increase the fetch bandwith provided by an aggressive sequential fetch unit from 5.8 for the original code to 10.6 using our proposed layout. Our software scheme combines well with hardware schemes like a trace cache providing up to 12.1 instruction per cycle, suggesting that commercial workloads may be amenable to the aggressive I-fetch of future superscalars.Peer ReviewedPostprint (published version

    Software trace cache

    Get PDF
    We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version

    Hyperheuristics for explicit resource partitioning in simultaneous multithreaded processors

    Get PDF

    ADACORE: Achieving Energy Efficiency via Adaptive Core Morphing at Runtime

    Get PDF
    Heterogeneous multicore processors offer an energy-efficient alternative to homogeneous multicores. Typically, heterogeneous multi-core refers to a system with more than one core where all the cores use a single ISA but differ in one or more micro-architectural configurations. A carefully designed multicore system consists of cores of diverse power and performance profiles. During execution, an application is run on a core that offers the best trade-off between performance and energy-efficiency. Since the resource needs of an application may vary with time, so does the optimal core choice. Moving a thread from one core to another involves transferring the entire processor state and cache warm-up. Frequent migration leads to large performance overhead, negating any benefits of migration. Infrequent migration on the other hand leads to missed opportunities. Thus, reducing overhead of migration is integral to harnessing benefits of heterogeneous multicores. \par This work proposes \textit{AdaCore}, a novel core architecture which pushes the heterogeneity exploited in the heterogeneous multicore into a single core. \textit{AdaCore} primarily addresses the resource bottlenecks in workloads. The design attempts to adaptively match the resource demands by reconfiguring on-chip resources at a fine-grain granularity. The adaptive core morphing allows core configurations with diverse power and performance profiles within a single core by adaptive voltage, frequency and resource reconfiguration. Towards this end, the proposed novel architecture while providing energy savings, improves performance with a low overhead in-core reconfiguration. This thesis further compares \textit{AdaCore} with a standard Out-of-Order core with capability to perform Dynamic Voltage and Frequency Scaling (DVFS) designed to achieve energy efficiency. The results presented in this thesis indicate that the proposed scheme can improve the performance/Watt of application, on average, by 32\% over a static out-of-order core and by 14\% over DVFS. The proposed scheme improves IPS2/WattIPS^{2}/Watt by 38\% over static out-of-order core

    Morrigan: A composite instruction TLB prefetcher

    Get PDF
    The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.This work is partially supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the NSF grant CCF-1912617, the Semiconductor Research Corporation grant 2936.001, and generous gifts from Intel Labs. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Cooperative cache scrubbing

    Get PDF
    Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems
    • …
    corecore