89 research outputs found
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads
due to large datasets and irregular access patterns that lead to frequent
high-latency page table walks (PTWs). PTWs can be reduced by using (i) large
hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both
solutions have significant drawbacks: increased access latency, power and area
(for hardware TLBs), and costly memory accesses, the need for large contiguous
memory blocks, and complex OS modifications (for software-managed TLBs). We
present Victima, a new software-transparent mechanism that drastically
increases the translation reach of the processor by leveraging the
underutilized resources of the cache hierarchy. The key idea of Victima is to
repurpose L2 cache blocks to store clusters of TLB entries, thereby providing
an additional low-latency and high-capacity component that backs up the
last-level TLB and thus reduces PTWs. Victima has two main components. First, a
PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on
the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache
replacement policy prioritizes keeping TLB entries in the cache hierarchy by
considering (i) the translation pressure (e.g., last-level TLB miss rate) and
(ii) the reuse characteristics of the TLB entries. Our evaluation results show
that in native (virtualized) execution environments Victima improves average
end-to-end application performance by 7.4% (28.7%) over the baseline four-level
radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art
software-managed TLB, across 11 diverse data-intensive workloads. Victima (i)
is effective in both native and virtualized environments, (ii) is completely
transparent to application and system software, and (iii) incurs very small
area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Operating System Kernels on Multi-core Architectures
Operating System (OS) kernels have been under research and development for decades, mainly assuming single processor and distributed hardware systems.
With the recent rise of multi-core chips that may incorporate a network on chip (NoC), new challenges have appeared that were not considered before.
Given that a complete multi-core system that works on a single system on chip (SoC) is now the normal case, different cores on a single SoC may
share other physical resources and data. This new sharing scheme on a SoC affects crucial aspects of an overall system like correctness, performance,
predictability, scalability and security. Both hardware and OSs to flexibly cooperate in order to provide
solutions for such challenges.
SoC mimics the internet somehow now, with different cores acting as computer nodes, and the network medium is given in an advanced digital fabrics like buses or NoCs, that are
a current research area. However, OSs are still assuming some (hardware) features like single physical memory and memory sharing for inter-process communication, page-based protection, cache operations, even when evolving from uniprocessor to multi-core processors.
Such features not only may degrade performance and other system aspects, but also
some of them make no sense for a multi-core SoC, and introduce some barriers and limitations. While new OS research is considering different kernel designs
to cope up with multi-core systems, they are still limited by the current commercial hardware architectures.
The objective of this thesis is to assess different kernel designs and implementations on multi-core hardware architectures.
Part of the contributions of the thesis is porting
RTEMS (RTOS) and seL4 microkernel to Epiphany and RISC-V hardware architectures respectively, trading-off the design and implementation decisions. This hands-on experience gave a better understanding of the real-world challenges regarding kernel designs and implementations
Software-Managed Address Translation
In this paper we explore software-managed address translation. The purpose of the study is to specify the memory management design for a high clock-rate PowerPC implementation
in which a simple design is a prerequisite for a fast clock and a short design cycle. We show that software-managed address translation is just as efficient as hardware-
managed address translation, and it is much more flexible. Operating systems such as OSF/1 and Mach charge between 0.10 and 0.28 cycles per instruction (CPI) for address translation using dedicated memory-management
hardware. Software-managed translation requires 0.05 CPI. Mechanisms to support such features as shared memory, superpages, sub-page protection, and sparse
address spaces can be defined completely in software, allowing much more flexibility than in hardware-defined mechanisms
Recommended from our members
A Secure and Formally Verified Commodity Multiprocessor Hypervisor
Commodity hypervisors are widely deployed to support virtual machines on multiprocessor server hardware. Modern hypervisors are complex and often integrated with an operating system kernel, posing a significant security risk as writing large, multiprocessor systems software is error-prone. Attackers that successfully exploit hypervisor vulnerabilities may gain unfettered access to virtual machine data and compromise the confidentiality and integrity of virtual machine data. Theoretically, formal verification offers a solution to this problem, by proving that the hypervisor implementation contains no vulnerabilities and protects virtual machine data under all circumstances. However, it remains unknown how one might feasibly verify the entire codebase of a complex, multiprocessor commodity system. My thesis is that modest changes to a commodity system can reduce the required proof effort such that it becomes possible to verify the security properties of the entire system.
This dissertation introduces microverification, a new approach for formally verifying the security properties of commodity systems. Microverification reduces the proof effort for a commodity system by retrofitting the system into a small core and a set of untrusted services, thus making it possible to reason about properties of the entire system by verifying the core alone. To verify the multiprocessor hypervisor core, we introduce security-preserving layers to modularize the proof without hiding information leakage so we can prove each layer of the implementation refines its specification, and the top layer specification is refined by all layers of the core implementation. To verify commodity hypervisor features that require dynamically changing information flow, we incorporate data oracles to mask intentional information flow. We can then prove noninterference at the top layer specification and guarantee the resulting security properties hold for the entire hypervisor implementation. Using microverification, we retrofitted the Linux KVM hypervisor with only modest modifications to its codebase. Using Coq, we proved that the hypervisor protects the confidentiality and integrity of VM data, including correctly managing tagged TLBs, shared multi-level page tables, and caches. Our work is the first machine-checked security proof for a commodity multiprocessor hypervisor. Experimental results with real application workloads demonstrate that verified KVM retains KVM’s functionality and performance
In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs)
The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing
significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be
expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely
handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB
interrupt handler, this provides lock-up free TLBs. This paper proposes the prepend and append schemes of in-lining the interrupt
handler into the available reorder buffer space. The two schemes are implemented on a performance model of the Alpha 21264
processor built by Alpha designers at the Palo Alto Design Center (PADC), California. We compare the overhead and performance
impact of handling TLB interrupts by the traditional scheme, the append in-lined scheme, and the prepend in-lined scheme. For
small, medium, and large memory footprints, the overhead is quantified by comparing the number and pipeline state of instructions
flushed, the energy savings, and the performance improvements. We find that lock-up free TLBs reduce the overhead of refetching and reexecuting the instructions flushed by 30-95 percent, reduce the execution time by 5-25 percent, and also reduce the energy wasted by 30-90 percent
- …