90 research outputs found
Using SMT to accelerate nested virtualization
IaaS datacenters offer virtual machines (VMs) to their clients, who in turn sometimes deploy their own virtualized environments, thereby running a VM inside a VM. This is known as nested virtualization. VMs are intrinsically slower than bare-metal execution, as they often trap into their hypervisor to perform tasks like operating virtual I/O devices. Each VM trap requires loading and storing dozens of registers to switch between the VM and hypervisor contexts, thereby incurring costly runtime overheads. Nested virtualization further magnifies these overheads, as every VM trap in a traditional virtualized environment triggers at least twice as many traps. We propose to leverage the replicated thread execution resources in simultaneous multithreaded (SMT) cores to alleviate the overheads of VM traps in nested virtualization. Our proposed architecture introduces a simple mechanism to colocate different VMs and hypervisors on separate hardware threads of a core, and replaces the costly context switches of VM traps with simple thread stall and resume events. More concretely, as each thread in an SMT core has its own register set, trapping between VMs and hypervisors does not involve costly context switches, but simply requires the core to fetch instructions from a different hardware thread. Furthermore, our inter-thread communication mechanism allows a hypervisor to directly access and manipulate the registers of its subordinate VMs, given that they both share the same in-core physical register file. A model of our architecture shows up to 2.3× and 2.6× better I/O latency and bandwidth, respectively. We also show a software-only prototype of the system using existing SMT architectures, with up to 1.3× and 1.5× better I/O latency and bandwidth, respectively, and 1.2--2.2× speedups on various real-world applications
Improving address translation performance in virtualized multi-tenant systems
With the explosive growth in dataset sizes, application memory footprints are commonly reaching hundreds of GBs. Such huge datasets pressure the TLBs, resulting
in frequent misses that must be resolved through a page walk – a long-latency pointer
chase through multiple levels of the in-memory radix-tree-based page table. Page walk
latency is particularly high under virtualization where address translation mandates traversing two radix-tree page tables in a process called a nested page walk, performing
up to 24 memory accesses. Page walk latency can be also amplified by the effects
caused by the colocation of applications on the same server used in an attempt to increase utilization. Under colocation, cache contention makes cache misses during a
nested page walk more frequent, piling up page walk latency. Both virtualization and
colocation are widely adopted in cloud platforms, such as Amazon Web Services and
Google Cloud Engine. As a result, in cloud environments, page walk latency can
reach hundreds of cycles, significantly reducing the overall application’s performance.
This thesis addresses the problem of the high page walk latency by 1 identifying
the sources of the high page walk latency under virtualization and/or colocation, and
2 proposing hardware and software techniques that accelerate page walks by means
of new memory allocation strategies for the page table and data which can be easily
adopted by existing systems.
Firstly, we quantify how the dataset size growth, virtualization, and colocation affect page walk latency. We also study how a high page walk latency affects perform ance. Due to the lack of dedicated tools for evaluating address translation overhead
on modern processors, we design a methodology to vary the page walk latency experienced by an application running on real hardware. To quantify the performance impact
of address translation, we measure the application’s execution time while varying the
page walk latency. We find that under virtualization, address translation considerably
limits performance: an application can waste up to 68% of execution time due to stalls
originating from page walks. In addition, we investigate which accesses from a nested
page walk are most significant for the overall page walk latency by examining from
where in the memory hierarchy these accesses are served. We find that accesses to the
deeper levels of the page table radix tree are responsible for most of the overall page
walk latency.
Based on these observations, we introduce two address translation acceleration
techniques that can be applied to any ISA that employs radix-tree page tables and
nested page walks. The first of these techniques is Prefetched Address Translation
(ASAP), a new software-hardware approach for mitigating the high page walk latency
caused by virtualization and/or application colocation. At the heart of ASAP is a
lightweight technique for directly indexing individual levels of the page table radix
tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page
table without first accessing the preceding levels, thus lowering the page walk latency.
ASAP is fully compatible with the existing radix-tree-based page table and requires
only incremental and isolated changes to the memory subsystem.
The second technique is PTEMagnet, a new software-only approach for reducing
address translation latency under virtualization and application colocation. Initially,
we identify a new address translation bottleneck caused by memory fragmentation
stemming from the interaction of virtualization, application colocation, and the Linux
memory allocator. The fragmentation results in the effective cache footprint of the
host page table being larger than that of the guest page table. The bloated footprint
of the host page table leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMag net. PTEMagnet prevents memory fragmentation by fine-grained reservation-based
memory allocation in the guest OS. PTEMagnet is fully legacy-preserving, requiring
no modifications to either user code or mechanisms for address translation and virtualization.
In summary, this thesis proposes non-disruptive upgrades to the virtual memory
subsystem for reducing page walk latency in virtualized deployments. In doing so,
this thesis evaluates the impact of page walk latency on the application’s performance, identifies the bottlenecks of the existing address translation mechanism caused
by virtualization, application colocation, and the Linux memory allocator, and proposes software-hardware and software-only solutions for eliminating the bottlenecks
Prefetched Address Translation
With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk - a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table.Anticipating further growth in dataset sizes and their adverse affect on TLB hit rates, this work seeks to accelerate page walks while fully preserving existing virtual memory abstractions and mechanisms - a must for software compatibility and generality. Our idea is to enable direct indexing into a given level of the page table, thus eliding the need to first fetch pointers from the preceding levels. A key contribution of our work is in showing that this can be done by simply ordering the pages containing the page table in physical memory to match the order of the virtual memory pages they map to. Doing so enables direct indexing into the page table using a base-plus-offset arithmetic.We introduce Address Translation with Prefetching (ASAP), a new approach for reducing the latency of address translation to a single access to the memory hierarchy. Upon a TLB miss, ASAP launches prefetches to the deeper levels of the page table, bypassing the preceding levels. These prefetches happen concurrently with a conventional page walk, which observes a latency reduction due to prefetching while guaranteeing that only correctly-predicted entries are consumed. ASAP requires minimal extensions to the OS and trivial microarchitectural support. Moreover, ASAP is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation. Our evaluation on a range of memory-intensive workloads shows that under SMT colocation, ASAP is able to reduce page walk latency by an average of 25% (42% max) in native execution, and 45% (55% max) under virtualization
Verification of the Tree-Based Hierarchical Read-Copy Update in the Linux Kernel
Read-Copy Update (RCU) is a scalable, high-performance Linux-kernel
synchronization mechanism that runs low-overhead readers concurrently with
updaters. Production-quality RCU implementations for multi-core systems are
decidedly non-trivial. Giving the ubiquity of Linux, a rare "million-year" bug
can occur several times per day across the installed base. Stringent validation
of RCU's complex behaviors is thus critically important. Exhaustive testing is
infeasible due to the exponential number of possible executions, which suggests
use of formal verification.
Previous verification efforts on RCU either focus on simple implementations
or use modeling languages, the latter requiring error-prone manual translation
that must be repeated frequently due to regular changes in the Linux kernel's
RCU implementation. In this paper, we first describe the implementation of Tree
RCU in the Linux kernel. We then discuss how to construct a model directly from
Tree RCU's source code in C, and use the CBMC model checker to verify its
safety and liveness properties. To our best knowledge, this is the first
verification of a significant part of RCU's source code, and is an important
step towards integration of formal verification into the Linux kernel's
regression test suite.Comment: This is a long version of a conference paper published in the 2018
Design, Automation and Test in Europe Conference (DATE
Parallel architectures and runtime systems co-design for task-based programming models
The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design.
This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holÃstica en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en dÃa el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayorÃa de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnologÃa actual está alcanzando limitaciones fÃsicas donde no será posible reducir el tamaño de los transistores motivando asà un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, asà como una metodologÃa para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version
ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks
In-Network Computing (INC) has found many applications for performance boosts
or cost reduction. However, given heterogeneous devices, diverse applications,
and multi-path network typologies, it is cumbersome and error-prone for
application developers to effectively utilize the available network resources
and gain predictable benefits without impeding normal network functions.
Previous work is oriented to network operators more than application
developers. We develop ClickINC to streamline the INC programming and
deployment using a unified and automated workflow. ClickINC provides INC
developers a modular programming abstractions, without concerning to the states
of the devices and the network topology. We describe the ClickINC framework,
model, language, workflow, and corresponding algorithms. Experiments on both an
emulator and a prototype system demonstrate its feasibility and benefits
Recommended from our members
The Design, Implementation, and Evaluation of Software and Architectural Support for Nested Virtualization on Modern Architectures
Nested virtualization, the discipline of running virtual machines inside other virtual machines, is increasingly important because of the need to deploy workloads that are already using virtualization on top of virtualized cloud infrastructures. However, nested virtualization performance on modern computer architectures is far from native execution speed, which remains a key impediment to further adoption. My thesis is that simple changes to hardware, software, and virtual machine configuration that are transparent to nested virtual machines can provide near-native execution speed for real application workloads. This dissertation presents three mechanisms that improve nested virtualization performance.
First, we present NEsted Virtualization Extensions for Arm (NEVE). As Arm servers make inroads in cloud infrastructure deployments, supporting nested virtualization on Arm is a key requirement. The requirement has recently been met with the introduction of nested virtualization support for the Arm architecture. We built the first hypervisor using Arm nested virtualization support and show that, despite similarities between Arm and x86 nested virtualization support, performance on Arm is much worse than on x86. This is due to excessive traps to the hypervisor caused by differences in non-nested virtualization support. To address this problem, we introduce a novel paravirtualization technique to rapidly prototype architectural changes for virtualization and evaluate their performance impact using existing hardware. Using this technique, we introduce NEVE, a set of simple architectural changes to Arm that can be used by software to coalesce and defer traps by logging the results of hypervisor instructions until the results are actually needed by the hypervisor. We show that NEVE allows hypervisors running real application workloads to provide an order of magnitude improvement in performance over current Arm nested virtualization support and up to three times less overhead than x86 nested virtualization. NEVE is included in the Armv8.4 architecture.
Second, we introduce virtual-passthrough, a new approach for providing virtual I/O devices for nested virtualization without the intervention of multiple levels of hypervisors. Virtual-passthrough preserves I/O interposition while addressing the performance problem of I/O intensive workloads as they perform many times worse with nested virtualization than without virtualization. With virtual-passthrough, virtual devices provided by a host hypervisor, the hypervisor that runs directly on the hardware, can be assigned to nested virtual machines directly without delivering data and control through multiple layers of hypervisors. The approach leverages the existing direct device assignment mechanism and implementation, so it only requires virtual machine configuration changes. Virtual-passthrough is platform-agnostic and easily supports important virtualization features such as migration. We have applied virtual-passthrough in the Linux KVM hypervisor for both x86 and Arm hardware, and show that it can provide more than an order of magnitude improvement in performance over current KVM virtual device support on real application workloads.
Third, we introduce Direct Virtual Hardware (DVH), a new approach that enables a host hypervisor to directly provide virtual hardware to nested virtual machines without the intervention of multiple levels of hypervisors. DVH is a generalization of virtual-passthrough and does not limit virtual hardware to I/O devices. Beyond virtual-passthrough, we introduce three additional DVH mechanisms: virtual timers, virtual inter-processor interrupts, and virtual idle. DVH provides virtual hardware for these mechanisms that mimics the underlying hardware and, in some cases, adds new enhancements that leverage the flexibility of software without the need for matching physical hardware support. We have implemented DVH in KVM. Our experimental results show that combining the four DVH mechanisms can provide even greater performance than virtual-passthrough alone and provide near-native execution speeds on real application workloads
Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems
Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming.
The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends.
The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines.
The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past
Modelling of Information Flow and Resource Utilization in the EDGE Distributed Web System
The adoption of Distributed Web Systems (DWS) into modern engineering design process has dramatically increased in recent years. The Engineering Design Guide and Environment (EDGE) is one such DWS, intended to provide an integrated set of tools for use in the development of new products and services. Previous attempts to improve the efficiency and scalability of DWS focused largely on hardware utilization (i.e. multithreading and virtualization) and software scalability (i.e. load balancing and cloud services). However, these techniques are often limited to analysis of the computational complexity of the algorithms implemented.
This work seeks to improve the understanding of efficiency and scalability of DWS by modelling the dynamics of information flow and resource utilization by characterizing DWS workloads through historical usage data (i.e. request type, frequency, access time). The design and implementation of EDGE is described. A DWS model of an EDGE system is developed and validated against theoretical limiting cases. The DWS model is used to predict the throughput of an EDGE system given a resource allocation and workflow. Results of the simulation suggest that proposed DWS designs can be evaluated according to the usage requirements of an engineering firm, ultimately guiding an informed decision for the selection and deployment of a DWS in an enterprise environment. Recommendations for future work related to the continued development of EDGE, DWS modelling of EDGE installation environments, and the extension of DWS modelling to new product development processes are presented
- …