2,479 research outputs found
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
Análisis y evaluación de prestaciones del empleo de técnicas de caché para acelerar el proceso de búsqueda en tablas de páginas
[EN] The process of translate virtual adresses to physical addresses made by the processor's MMU requires multiple memory acceses to retrieve the input of each of the levels of the tree in the page table structure. The TLBs (Translaton Lookaside Buffers) avoids these accesses storing the addresses of pages that have been translated more recently. However, whenever the TLB have a failure, is required complete the entire translation procces accesing multiple times to the page table in memory. The translation process was initially supported by software via the OS (exception handler made by each TLB miss), which greatly slowed the resolution of TLB misses. Fortunately, current processor have hardware support for the translation proccess by an automata called Page Table Walker, which is responsible to access in memory to each translation levels of page table. However, each of these memory accesses involves high latency compared to the access in cache levels, which appear increagsenly faster in relative terms. An effective way to acelerate this translation process is introduce a cache whitin (or near) to the MMU to store partial translations corresponding to each of translation levles of the page tables. Current proposal in this regard show the advantages to use the Page Table Walk Cache (PTWC) in each core. In this paper, in adittion to analyzing these adavantges employment settings of a shared PTWC configuration among dofferent cores make up the CMP. This analysis attemps to show the advantages of using a shared PTWC versus private PTWC[ES] El proceso de traducción de direcciones virtuales a direcciones fÃsicas llevado a cabo por la MMU de los procesadores requiere varios accesos a memoria para recuperar la entrada de cada uno de los niveles del árbol en que se estructura la tabla de páginas. Las TLBs (Translation Lookaside Buffers) evitan estos accesos almacenando las direcciones de páginas que se han traducido mhave been translate ás recientemente. Sin embargo, cada vez que se produce un fallo en la TLB, se requiere llevar a cabo el proceso completo de traducción accediendo múltiples veces a la tabla de páginas en memoria. El proceso de traducción estaba inicialmente soportado por software a cargo del SO (manejador de excepción producida con cada fallo de TLB), lo cual ralentizaba enormemente la resolución de los fallos de TLB. Afortunadamente, los procesadores actuales incorporan soporte hardware al proceso de traducción, mediante un autómata denominado Page Table Walker el cual se encarga de ir accediendo a memoria a los distintos niveles de traducción de la tabla de páginas. No obstante, cada uno de estos accesos a memoria principal entraña una elevada latencia en comparación con los accesos a los niveles de cache, los cuales aparecen cada vez más rápidos en términos relativos. Una manera efectiva para acelerar este proceso de traducción es introducir una cache dentro (o cerca) de la MMU para almacenar las traducciones parciales correspondientes con cada uno de los niveles de traducción de la tabla de páginas. Las propuestas actuales en este sentido ponen de manifiesto las ventajas del uso de la Page Table Walk Cache (PTWC) en cada núcleo. En este trabajo, además de analizar dichas ventajas, analiza las que se deriva del empleo de una configuración de PTWC compartida entre los distintos núcleos que componen el CMP. Dicho análisis trata de mostrar las ventajas del empleo de una PTWC compartida frente a una configuración de PTWC privadasRamis Fuambuena, A. (2014). Análisis y evaluación de prestaciones del empleo de técnicas de caché para acelerar el proceso de búsqueda en tablas de páginas. http://hdl.handle.net/10251/56684Archivo delegad
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
Prefetched Address Translation
With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk - a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table.Anticipating further growth in dataset sizes and their adverse affect on TLB hit rates, this work seeks to accelerate page walks while fully preserving existing virtual memory abstractions and mechanisms - a must for software compatibility and generality. Our idea is to enable direct indexing into a given level of the page table, thus eliding the need to first fetch pointers from the preceding levels. A key contribution of our work is in showing that this can be done by simply ordering the pages containing the page table in physical memory to match the order of the virtual memory pages they map to. Doing so enables direct indexing into the page table using a base-plus-offset arithmetic.We introduce Address Translation with Prefetching (ASAP), a new approach for reducing the latency of address translation to a single access to the memory hierarchy. Upon a TLB miss, ASAP launches prefetches to the deeper levels of the page table, bypassing the preceding levels. These prefetches happen concurrently with a conventional page walk, which observes a latency reduction due to prefetching while guaranteeing that only correctly-predicted entries are consumed. ASAP requires minimal extensions to the OS and trivial microarchitectural support. Moreover, ASAP is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation. Our evaluation on a range of memory-intensive workloads shows that under SMT colocation, ASAP is able to reduce page walk latency by an average of 25% (42% max) in native execution, and 45% (55% max) under virtualization
- …