2,479 research outputs found

    The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework

    Full text link
    Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significanttly improves performance over conventional virtual memory

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Análisis y evaluación de prestaciones del empleo de técnicas de caché para acelerar el proceso de búsqueda en tablas de páginas

    Full text link
    [EN] The process of translate virtual adresses to physical addresses made by the processor's MMU requires multiple memory acceses to retrieve the input of each of the levels of the tree in the page table structure. The TLBs (Translaton Lookaside Buffers) avoids these accesses storing the addresses of pages that have been translated more recently. However, whenever the TLB have a failure, is required complete the entire translation procces accesing multiple times to the page table in memory. The translation process was initially supported by software via the OS (exception handler made by each TLB miss), which greatly slowed the resolution of TLB misses. Fortunately, current processor have hardware support for the translation proccess by an automata called Page Table Walker, which is responsible to access in memory to each translation levels of page table. However, each of these memory accesses involves high latency compared to the access in cache levels, which appear increagsenly faster in relative terms. An effective way to acelerate this translation process is introduce a cache whitin (or near) to the MMU to store partial translations corresponding to each of translation levles of the page tables. Current proposal in this regard show the advantages to use the Page Table Walk Cache (PTWC) in each core. In this paper, in adittion to analyzing these adavantges employment settings of a shared PTWC configuration among dofferent cores make up the CMP. This analysis attemps to show the advantages of using a shared PTWC versus private PTWC[ES] El proceso de traducción de direcciones virtuales a direcciones físicas llevado a cabo por la MMU de los procesadores requiere varios accesos a memoria para recuperar la entrada de cada uno de los niveles del árbol en que se estructura la tabla de páginas. Las TLBs (Translation Lookaside Buffers) evitan estos accesos almacenando las direcciones de páginas que se han traducido mhave been translate ás recientemente. Sin embargo, cada vez que se produce un fallo en la TLB, se requiere llevar a cabo el proceso completo de traducción accediendo múltiples veces a la tabla de páginas en memoria. El proceso de traducción estaba inicialmente soportado por software a cargo del SO (manejador de excepción producida con cada fallo de TLB), lo cual ralentizaba enormemente la resolución de los fallos de TLB. Afortunadamente, los procesadores actuales incorporan soporte hardware al proceso de traducción, mediante un autómata denominado Page Table Walker el cual se encarga de ir accediendo a memoria a los distintos niveles de traducción de la tabla de páginas. No obstante, cada uno de estos accesos a memoria principal entraña una elevada latencia en comparación con los accesos a los niveles de cache, los cuales aparecen cada vez más rápidos en términos relativos. Una manera efectiva para acelerar este proceso de traducción es introducir una cache dentro (o cerca) de la MMU para almacenar las traducciones parciales correspondientes con cada uno de los niveles de traducción de la tabla de páginas. Las propuestas actuales en este sentido ponen de manifiesto las ventajas del uso de la Page Table Walk Cache (PTWC) en cada núcleo. En este trabajo, además de analizar dichas ventajas, analiza las que se deriva del empleo de una configuración de PTWC compartida entre los distintos núcleos que componen el CMP. Dicho análisis trata de mostrar las ventajas del empleo de una PTWC compartida frente a una configuración de PTWC privadasRamis Fuambuena, A. (2014). Análisis y evaluación de prestaciones del empleo de técnicas de caché para acelerar el proceso de búsqueda en tablas de páginas. http://hdl.handle.net/10251/56684Archivo delegad

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    Prefetched Address Translation

    Get PDF
    With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk - a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table.Anticipating further growth in dataset sizes and their adverse affect on TLB hit rates, this work seeks to accelerate page walks while fully preserving existing virtual memory abstractions and mechanisms - a must for software compatibility and generality. Our idea is to enable direct indexing into a given level of the page table, thus eliding the need to first fetch pointers from the preceding levels. A key contribution of our work is in showing that this can be done by simply ordering the pages containing the page table in physical memory to match the order of the virtual memory pages they map to. Doing so enables direct indexing into the page table using a base-plus-offset arithmetic.We introduce Address Translation with Prefetching (ASAP), a new approach for reducing the latency of address translation to a single access to the memory hierarchy. Upon a TLB miss, ASAP launches prefetches to the deeper levels of the page table, bypassing the preceding levels. These prefetches happen concurrently with a conventional page walk, which observes a latency reduction due to prefetching while guaranteeing that only correctly-predicted entries are consumed. ASAP requires minimal extensions to the OS and trivial microarchitectural support. Moreover, ASAP is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation. Our evaluation on a range of memory-intensive workloads shows that under SMT colocation, ASAP is able to reduce page walk latency by an average of 25% (42% max) in native execution, and 45% (55% max) under virtualization
    • …
    corecore