4 research outputs found

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Implications of Emerging 3D GPU Architecture on the Scan Primitive

    No full text

    Implications of Emerging 3D GPU Architecture on the Scan Primitive

    No full text
    Analytic database workloads are growing in data size and query complexity. At the same time, computer architects are struggling to continue the meteoric increase in performance enabled by Moore's Law. We explore the impact of two emerging architectural trends which may help continue the Moore's Law performance trend for analytic database workloads, namely 3D die-stacking and tight accelerator-CPU integration, specifically GPUs. GPUs have evolved from fixed-function units, to programmable discrete chips, and now are integrated with CPUs in most manufactured chips. Past efforts to use GPUs for analytic query processing have not had widespread practical impact, but it is time to re-examine and re-optimize database algorithms for massively data-parallel architectures. We argue that high-throughput data-parallel accelerators are likely to play a big role in future systems as they can be easily exploited by database systems and are becoming ubiquitous. Using the simple scan primitive as an example, we create a starting point for this discussion. We project the performance of both CPUs and GPUs in emerging 3D systems and show that the high-throughput data-parallel architecture of GPUs is more efficient in these future systems. We show that if database designers embrace emerging 3D architectures, there is possibly an order of magnitude performance and energy efficiency gain

    Near-Memory Address Translation

    Get PDF
    Virtual memory (VM) is a crucial abstraction in modern computer systems at any scale, from handheld devices to datacenters. VM provides programmers the illusion of an always sufficiently large and linear memory, making programming easier. Although the core components of VM have remained largely unchanged since early VM designs, the design constraints and usage patterns of VM have radically shifted from when it was invented. Today, computer systems integrate hundreds of gigabytes to a few terabytes of memory, while tightly integrated heterogeneous computing platforms (e.g., CPUs, GPUs, FPGAs) are becoming increasingly ubiquitous. As there is a clear trend towards extending the CPU's VM to all computing elements in the system for an efficient and easy to use programming model, the continuous demand for faster memory accesses calls for fast translations to terabytes of memory for any computing element in the system. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of today's translation caches, such as TLBs. In this thesis, we provide fundamental insights into the reason why address translation sits on the critical path of accessing memory. We observe that the traditional fully associative flexibility to map any virtual page to any page frame precludes accessing memory before translating. We study the associativity in VM across a variety of scenarios by classifying page faults using the 3C model developed for caches. Our study demonstrates that the full associativity of VM is unnecessary, and only modest associativity is required. We conclude that capacity and compulsory misses---which are unaffected by associativity---dominate, while conflict misses rapidly disappear as the associativity of VM increases. Building on the modest associativity requirements, we propose a distributed memory management unit close to where the data resides to reduce or eliminate the TLB miss penalty
    corecore