4 research outputs found

    Near-Memory Address Translation

    Get PDF
    Virtual memory (VM) is a crucial abstraction in modern computer systems at any scale, from handheld devices to datacenters. VM provides programmers the illusion of an always sufficiently large and linear memory, making programming easier. Although the core components of VM have remained largely unchanged since early VM designs, the design constraints and usage patterns of VM have radically shifted from when it was invented. Today, computer systems integrate hundreds of gigabytes to a few terabytes of memory, while tightly integrated heterogeneous computing platforms (e.g., CPUs, GPUs, FPGAs) are becoming increasingly ubiquitous. As there is a clear trend towards extending the CPU's VM to all computing elements in the system for an efficient and easy to use programming model, the continuous demand for faster memory accesses calls for fast translations to terabytes of memory for any computing element in the system. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of today's translation caches, such as TLBs. In this thesis, we provide fundamental insights into the reason why address translation sits on the critical path of accessing memory. We observe that the traditional fully associative flexibility to map any virtual page to any page frame precludes accessing memory before translating. We study the associativity in VM across a variety of scenarios by classifying page faults using the 3C model developed for caches. Our study demonstrates that the full associativity of VM is unnecessary, and only modest associativity is required. We conclude that capacity and compulsory misses---which are unaffected by associativity---dominate, while conflict misses rapidly disappear as the associativity of VM increases. Building on the modest associativity requirements, we propose a distributed memory management unit close to where the data resides to reduce or eliminate the TLB miss penalty

    Prebúsqueda hardware en aplicaciones comerciales

    Get PDF
    Las técnicas de prebúsqueda intentan solventar la diferencia entre la latencia de acceso a memoria y el tiempo de ciclo del procesador. Esta diferencia, conocida como “Memory Gap” o “Memory Wall”, es de dos órdenes de magnitud y continúa aumentando. Junto a factores térmicos y de alimentación constituye la principal la limitación para incrementar el rendimiento de los procesadores actuales. Técnicas propuestas recientemente aprovechan el hecho de que las secuencias de referencias a memoria se repiten a lo largo del tiempo y además en el mismo orden. El talón de Aquiles de estos prebuscadores es la imposibilidad de predecir accesos a memoria para direcciones que no se han visitado anteriormente. Este coste de oportunidad se magnifica en aplicaciones donde una gran mayoría del conjunto de datos del programa es leía una sola vez. Existe otro tipo de técnicas basadas en la observación de que los programas acceden al espacio de direcciones mediante patrones comunes alineados a regiones de memoria, y son predecibles mediante correlación en código. Una de las limitaciones más importantes de este tipo de prebuscadores es la necesidad de un primer acceso en la región para comenzar la predicción de los bloques que van a ser referenciados dentro de la región, el cual no puede ser amortizado con los bloques prebuscados correctamente más allá del tamaño de la región, que es finito. Otro es el mecanismo en sí, antes explicado que inicia las predicciones. Debido a que realiza la predicción para toda la región, si una predicción es incorrecta perdemos toda la oportunidad de predecir correctamente dentro de la misma. El estado del arte en prebúsqueda de datos correlaciona temporalmente los accesos que inician predicciones dentro de las regiones. Este PFC muestra las ineficiencias de los prebuscadores más avanzados y propone dos técnicas para solventarlas: (1) mecanismo de predicción por secuencia de delta direcciones – iniciando las predicciones mediante la última secuencia de deltas observada en la secuencia de referencias a memoria, y (2) predicción por acceso – cada vez que el procesador envía una petición de datos a la jerarquía de memorias, se realiza una predicción. Los resultados indican que la correlación a través de deltas puede predecir patrones recurrentes de acceso a direcciones de memoria nunca antes referenciadas mejorando la predicción temporal de accesos a memoria, y que las predicciones por acceso mejoran los resultados obtenidos por los prebuscadores basados en regiones eliminando la limitación de coste de oportunidad de las predicciones incorrectas dentro de la región

    Unlocking Energy

    Get PDF
    Locks are a natural place for improving the energy efficiency of software systems. First, concurrent systems are mainstream and when their threads synchronize, they typically do it with locks. Second, locks are well-defined abstractions, hence changing the algorithm implementing them can be achieved without modifying the system. Third, some locking strategies consume more power than others, thus the strategy choice can have a real effect. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with improving their throughput. It is a win-win situation. We make our case for this throughput/energy-efficiency correlation through a series of observations obtained from an exhaustive analysis of the energy efficiency of locks on two modern processors and six software systems: Memcached, MySQL, SQLite, RocksDB, HamsterDB, and Kyoto Kabinet. We propose simple lock-based techniques for improving the energy efficiency of these systems by 33% on average, driven by higher throughput, and without modifying the systems

    Design Guidelines for High-Performance SCM Hierarchies

    Get PDF
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a fewfactors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache’s design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM’s read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements
    corecore