14 research outputs found

    Physical Address Decoding in Intel Xeon v3/v4 CPUs: A Supplemental Datasheet

    Get PDF
    The mapping of the physical address space to actual physical locations in DRAM is a complex multistage process on today\u27s systems. Research in domains such as operating systems and system security would benefit from proper documentation of that address translation, yet publicly available datasheets are often incomplete. To spare others the effort of reverse-engineering, we present our insights about the address decoding stages of the Intel Xeon E5 v3 and v4 processors in this report, including the layout and the addresses of all involved configuration registers, as far as we have become aware of them in our experiments. In addition, we present a novel technique for reverse-engineering of interleaving functions by mapping physically present DRAM multiple times into the physical address space

    Virtual InfiniBand Clusters for HPC Clouds

    Get PDF
    High Performance Computing (HPC) employs fast interconnect technologies to provide low communication and synchronization latencies for tightly coupled parallel compute jobs. Contemporary HPC clusters have a xed capacity and static runtime environments; they cannot elastically adapt to dynamic workloads, and provide a limited selection of applications, libraries, and system software. In contrast, a cloud model for HPC clusters promises more exibility, as it provides elastic virtual clusters to be available on-demand. This is not possible with physically owned clusters. In this paper, we present an approach that makes it possible to use InfiniBand clusters for HPC cloud computing. We propose a performance-driven design of an HPC IaaS layer for In niBand, which provides throughput and latency-aware virtualization of nodes, networks, and network topologies, as well as an approach to an HPC-aware, multi-tenant cloud management system for elastic virtualized HPC compute clusters

    NTT software optimization using an extended Harvey butterfly

    Get PDF
    Software implementations of the number-theoretic transform (NTT) method often leverage Harvey’s butterfly to gain speedups. This is the case in cryptographic libraries such as IBM’s HElib, Microsoft’s SEAL, and Intel’s HEXL, which provide optimized implementations of fully homomorphic encryption schemes or their primitives. We extend the Harvey butterfly to the radix-4 case for primes in the range [2^31, 2^52). This enables us to use the vector multiply sum logical (VMSL) instruction, which is available on recent IBM Z^(R) platforms. On an IBM z14 system, our implementation performs more than 2.5x faster than the scalar implementation of SEAL we converted to native C. In addition, we implemented a mixed-radix implementation that uses AVX512-IFMA on Intel’s Ice Lake processor, which happens to be ~1.1 times faster than the super-optimized implementation of Intel’s HEXL. Finally, we compare the performance of some of our implementation using GCC versus Clang compilers and discuss the results

    KSM++: Using I/O-based hints to make memory-deduplication scanners more efficient

    Get PDF
    Memory scanning deduplication techniques, as implemented in Linux\u27 Kernel Samepage Merging (KSM), work very well for deduplicating fairly static, anonymous pages with equal content across different virtual machines. However, scanners need very aggressive scan rates when it comes to identifying sharing opportunities with a short life span of up to about 5 min. Otherwise, the scan process is not fast enough to catch those short-lived pages. Our approach generates I/O-based hints in the host to make the memory scanning process more efficient, thus enabling it to find and exploit short-lived sharing opportunities without raising the scan rate. Experiences with similar techniques for paravirtualized guests have shown that pages in a guest’s unified buffer cache are good sharing candidates. We already identify such pages in the host when carrying out I/O-operations on behalf of the guest. The target/source pages in the guest can safely be assumed to be part of the guest’s unified buffer cache. That way, we can determine good sharing hints for the memory scanner. A modification of the guest is not required. We have implemented our approach in Linux. By modifying the KSM scanning mechanism to process these hints preferentially, we move the associated sharing opportunities earlier into the merging stage. Thereby, we deduplicate more pages than the baseline system. In our evaluation, we identify sharing opportunities faster and with less overhead than the traditional linear scanning policy. KSM needs to follow about seven times as many pages as we do, to find a sharing opportunity

    SimuBoost: Scalable Parallelization of Functional System Simulation

    Get PDF
    The limited execution speed of current full system simulators restricts their applicability for dynamic analysis to shortrunning workloads. When analyzing memory contents while simulating a kernel build with Simics, we encountered slowdowns of more than 5000x resulting in 10months of total simulation time. Prior work improved the simulation speed by simulating virtual CPU cores on separate physical CPU cores simultaneously or by applying sampling and extrapolation methods to focus costly analyses on short execution windows. However, these approaches inherently su er from limited scalability or trading accuracy for speed. SimuBoost is a novel idea to parallelize functional full system simulation of single-cores. Our approach takes advantage of fast execution through virtualization, taking checkpoints in regular intervals. The parts between subsequent checkpoints are then simulated and analyzed simultaneously in one job per interval. By transferring jobs to multiple nodes, a parallelized and distributed simulation of the target workload can be achieved, thereby e ectively reducing the overall required simulation time. As no implementation of SimuBoost exists yet, we present a formal model to evaluate the general speedup and scalability characteristics of our acceleration technique. We moreover provide a model to estimate the required number of simulation nodes for optimal performance. According to this model, our approach can speed up conventional simulation in a realistic scenario by a factor of 84, while delivering a parallelization efficiency of 94%

    LoGA : Low-Overhead GPU Accounting Using Events

    Get PDF
    Over the last few years, GPUs have become common in computing. However, current GPUs are not designed for a shared environment like a cloud, creating a number of challenges whenever a GPU must be multiplexed between multiple users. In particular, the round-robin scheduling used by today\u27s GPUs does not distribute the available GPU computation time fairly among applications. Most of the previous work addressing this problem resorted to scheduling all GPU computation in software, which induces high overhead. While there is a GPU scheduler called NEON which reduces the scheduling overhead compared to previous work, NEON\u27s accounting mechanism frequently disables GPU access for all but one application, resulting in considerable overhead if that application does not saturate the GPU by itself. In this paper, we present LoGA, a novel accounting mechanism for GPU computation time. LoGA monitors the GPU\u27s state to detect GPU-internal context switches, and infers the amount of GPU computation time consumed by each process from the time between these context switches. This method allows LoGA to measure GPU computation time consumed by applications while keeping all applications running concurrently. As a result, LoGA achieves a lower accounting overhead than previous work, especially for applications that do not saturate the GPU by themselves. We have developed a prototype which combines LoGA with the pre-existing NEON scheduler. Experiments with that prototype have shown that LoGA induces no accounting overhead while still delivering accurate measurements of applications\u27 consumed GPU computation time

    LoGV: Low-overhead GPGPU Virtualization

    Get PDF
    Over the last few years, running high performance computing applications in the cloud has become feasible. At the same time, GPGPUs are delivering unprecedented performance for HPC applications. Cloud providers thus face the challenge to integrate GPGPUs into their virtualized platforms, which has proven difficult for current virtualization stacks. In this paper, we present LoGV, an approach to virtualize GPGPUs by leveraging protection mechanisms already present in modern hardware. LoGV enables sharing of GPGPUs between VMs as well as VM migration without modifying the host driver or the guest’s CUDA runtime. LoGV allocates resources securely in the hypervisor which then grants applications direct access to these resources, relying on GPGPU hardware features to guarantee mutual protection between applications. Experiments with our prototype have shown an overhead of less than 4% compared to native execution

    GPrioSwap : Towards a Swapping Policy for GPUs

    Get PDF
    Over the last few years, Graphics Processing Units (GPUs) have become popular in computing, and have found their way into a number of cloud platforms. However, integrating a GPU into a cloud environment requires the cloud provider to efficiently virtualize the GPU. While several research projects have addressed this challenge in the past, few of these projects attempt to properly enable sharing of GPU memory between multiple clients: To date, GPUswap is the only project that enables sharing of GPU memory without inducing unnecessary application overhead, while maintaining both fairness and high utilization of GPU memory. However, GPUswap includes only a rudimentary swapping policy, and therefore induces a rather large application overhead. In this paper, we work towards a practicable swapping policy for GPUs. To that end, we analyze the behavior of various GPU applications to determine their memory access patterns. Based on our insights about these patterns, we derive a swapping policy that includes a developer-assigned priority for each GPU buffer in its swapping decisions. Experiments with our prototype implementation show that a swapping policy based on buffer priorities can significantly reduce the swapping overhead
    corecore