26 research outputs found

    TLB pre-loading for Java applications

    Get PDF
    The increasing memory requirement for today\u27s applications is causing more stress for the memory system. This side effect puts pressure into available caches, and specifically the TLB cache. TLB misses are responsible for a considerable ratio of the total memory latency, since an average of 10% of execution time is wasted on miss penalties. Java applications are not in a better position. Their attractive features increase the memory footprint. Generally, Java applications TLB miss rate tends to be multiples of miss rate for non-java applications. The high miss rate will cause the application to loose valuable execution time. Our experiments show that on average, miss penalty can constitute about 24% of execution time. Several hardware modifications were suggested to reduce TLB misses for general applications. However, to the best of our knowledge, there have been no similar efforts for Java applications. Here we propose a software-based prediction model that relies on information available to the virtual machine. The model uses the write barrier operation to predict TLB misses with an average 41% accuracy rate

    Morrigan: A composite instruction TLB prefetcher

    Get PDF
    The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.This work is partially supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the NSF grant CCF-1912617, the Semiconductor Research Corporation grant 2936.001, and generous gifts from Intel Labs. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Morrigan: A Composite Instruction TLB Prefetcher

    Get PDF
    The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.This work is partially supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the NSF grant CCF-1912617, the Semiconductor Research Corporation grant 2936.001, and generous gifts from Intel Labs. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    PROFILE- AND INSTRUMENTATION- DRIVEN METHODS FOR EMBEDDED SIGNAL PROCESSING

    Get PDF
    Modern embedded systems for digital signal processing (DSP) run increasingly sophisticated applications that require expansive performance resources, while simultaneously requiring better power utilization to prolong battery-life. Achieving such conflicting objectives requires innovative software/hardware design space exploration spanning a wide-array of techniques and technologies that offer trade-offs among performance, cost, power utilization, and overall system design complexity. To save on non-recurring engineering (NRE) costs and in order to meet shorter time-to-market requirements, designers are increasingly using an iterative design cycle and adopting model-based computer-aided design (CAD) tools to facilitate analysis, debugging, profiling, and design optimization. In this dissertation, we present several profile- and instrumentation-based techniques that facilitate design and maintenance of embedded signal processing systems: 1. We propose and develop a novel, translation lookaside buffer (TLB) preloading technique. This technique, called context-aware TLB preloading (CTP), uses a synergistic relationship between the (1) compiler for application specific analysis of a task's context, and (2) operating system (OS), for run-time introspection of the context and efficient identification of TLB entries for current and future usage. CTP works by (1) identifying application hotspots using compiler-enabled (or manual) profiling, and (2) exploiting well-understood memory access patterns, typical in signal processing applications, to preload the TLB at context switch time. The benefits of CTP in eliminating inter-task TLB interference and preemptively allocating TLB entries during context-switch are demonstrated through extensive experimental results with signal processing kernels. 2. We develop an instrumentation-driven approach to facilitate the conversion of legacy systems, not designed as dataflow-based applications, to dataflow semantics by automatically identifying the behavior of the core actors as instances of well-known dataflow models. This enables the application of powerful dataflow-based analysis and optimization methods to systems to which these methods have previously been unavailable. We introduce a generic method for instrumenting dataflow graphs that can be used to profile and analyze actors, and we use this instrumentation facility to instrument legacy designs being converted and then automatically detect the dataflow models of the core functions. We also present an iterative actor partitioning process that can be used to partition complex actors into simpler entities that are more prone to analysis. We demonstrate the utility of our proposed new instrumentation-driven dataflow approach with several DSP-based case studies. 3. We extend the instrumentation technique discussed in (2) to introduce a novel tool for model-based design validation called dataflow validation framework (DVF). DVF addresses the problem of ensuring consistency between (1) dataflow properties that are declared or otherwise assumed as part of dataflow-based application models, and (2) the dataflow behavior that is exhibited by implementations that are derived from the models. The ability of DVF to identify disparities between an application's formal dataflow representation and its implementation is demonstrated through several signal processing application development case studies
    corecore