104 research outputs found

    Increasing and Detecting Memory Address Congruence

    Get PDF
    A static memory reference exhibits a unique property when its dynamic memory addresses are congruent with respect to some non-trivial modulus. Extraction of this congruence information at compile-time enables new classes of program optimization. In this paper, we present methods for forcing congruence among the dynamic addresses of a memory reference. We also introduce a compiler algorithm for detecting this property. Our transformations do not require interprocedural analysis and introduce almost no overhead. As a result, they can be incorporated into real compilation systems. On average, our transformations are able to achieve a five-fold increase in the number of congruent memory operations. We are then able to detect 95% of these references. This success is invaluable in providing performance gains in a variety of areas. When congruence information is incorporated into a vectorizing compiler, we can increase the performance of a G4 AltiVec processor up to a factor of two. Using the same methods, we are able to reduce energy consumption in a data cache by as much as 35%.Singapore-MIT Alliance (SMA

    MMP

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 129-135).Reliability and security are quickly becoming users' biggest concern due to the increasing reliance on computers in all areas of society. Hardware-enforced, fine-grained memory protection can increase the reliability and security of computer systems, but will be adopted only if the protection mechanism does not compromise performance, and if the hardware mechanism can be used easily by existing software. Mondriaan memory protection (MMP) provides fine-grained memory protection for a linear address space, while supporting an efficient hardware implementation. MMP's use of linear addressing makes it compatible with current software programming models and program binaries, and it is also backwards compatible with current operating systems and instruction sets. MMP can be implemented efficiently because it separates protection information from program data, allowing protection information to be compressed and cached efficiently. This organization is similar to paging hardware, where the translation information for a page of data bytes is compressed to a single translation value and cached in the TLB. MMP stores protection information in tables in protected system memory, just as paging hardware stores translation information in page tables. MMP is well suited to improve the robustness of modern software. Modern software development favors modules (or plugins) as a way to structure and provide extensibility for large systems, like operating systems, web servers and web clients. Protection between modules written in unsafe languages is currently provided only by programmer convention, reducing system stability.(cont.) Device drivers, which are implemented as loadable modules, are now the most frequent source of operating system crashes (e.g., 85% of Windows XP crashes in one study [SBL03]). MMP provides a mechanism to enforce module boundaries, increasing system robustness by isolating modules from each other and making all memory sharing explicit. We implement the MMP hardware in a simulator and modify a version of the Linux 2.4.19 operating system to use it. Linux loads its device drivers as kernel module extensions, and MMP enforces the module boundaries, only allowing the device drivers access to the memory they need to function. The memory isolation provided by MMP increases Linux's resistance to programmer error, and exposed two kernel bugs in common, heavily-tested drivers. Experiments with several benchmarks where MMP was used extensively indicate the space taken by the MMP data structures is less than 11% of the memory used by the kernel, and the kernel's runtime, according to a simple performance model, increases less than 12% (relative to an unmodified kernel).by Emmett Jethro Witchel.Ph.D

    GPUfs: The Case for Operating System Services on GPUs

    Get PDF
    Due to their impressive price/performance and performance/watt curves, GPUs have become the processor of choice for many types of intensively parallel computations from data mining to molecular dynamics simulations Unfortunately, GPU programming models are still almost entirely lacking core system abstractions, such as files and sockets, that CPU programmers have taken for granted for decades. Today's GPU is capable of amazing computational feats when fed with the right data and managed by application code on the host CPU, but it is incapable of initiating basic system interactions for itself, such as reading an input file from a disk. Because core system abstractions are unavailable to GPU code, GPU programmers today face many of the same challenges CPU application developers did a half-century ago-particularly the constant reimplementation of system abstractions such as data movement and management operations. We feel the time has come to provide GPU programs with the useful system services that CPU code already enjoys. This goal emerges from a broader trend to integrate GPUs more cleanly with operating systems (OS), as exemplified by recent work to support pipeline composition of GPU tasks Two key GPU characteristics make developing OS abstractions for GPUs challenging: data parallelism, and independent memory system. GPUs are optimized for Single Program Multiple Data (SPMD) parallelism, where the same program is used to concurrently process many different parts of the input data. GPU programs typically use tens of thousands of lightweight threads running similar or identical code with little control-flow variation. Conventional OS services, such as the POSIX file system API, were not built with such an execution environment in mind. In GPUfs, we had to adapt both the API semantics and its implementation to support such massive parallelism, allowing thousands of threads to efficiently invoke open, close, read, or write calls simultaneously. To feed their voracious appetites for data, high-end GPUs usually have their own DRAM storage. A massively parallel memory interface to this DRAM offers high bandwidth for local access by GPU code, but GPU access to the CPU's system memory is an order of magnitude slower, because it requires communication over bandwidth-constrained, higher latency PCI Express bus. In the increasingly common case of systems with multiple discrete GPUs- http://doi.acm.org/10.1145/2656206 standard in Apple's new Mac Pro, for example-each GPU has its own local memory, and accessing a GPU's own memory can be an order of magnitude more efficient than accessing a sibling GPU's memory. GPUs thus exhibit a particularly extreme non-uniform memory access (NUMA) property, making it performance-critical for the OS to optimize for access locality in data placement and reuse across CPU and GPU memories. GPUfs, for example, distributes its buffer cache across all CPU and GPU memories to enable idioms like process pipelines that read and write files from the same or different processors. To highlight the benefits of bringing core OS abstractions such as files to GPUs, we show the use of GPUfs in a self-contained GPU application for string search on NVIDIA GPUs. This application illustrates how GPUfs can efficiently support irregular workloads, in which parallel threads open and access dynamically-selected files of varying size and composition, and the output size might be arbitrarily large and determined at runtime. Our version is about seven times faster than a parallel 8-core CPU run on a full Linux kernel source stored in about 33,000 small files. While currently our prototype benchmarks represent only a few simple application data points for a single OS abstraction, they suggest that OS services for GPU code are not only hypothetically desirable, but are feasible and efficient in practice

    GPUfs: Integrating a File System with GPUs

    Get PDF
    As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. To make GPUs easier to program and improve their integration with operating systems, we propose making the host’s file system directly accessible to GPU code. GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUfs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU

    Anon-Pass: Practical Anonymous Subscriptions

    Get PDF
    We present the design, security proof, and implementation of an anonymous subscription service. Users register for the service by providing some form of identity, which might or might not be linked to a real-world identity such as a credit card, a web login, or a public key. A user logs on to the system by presenting a credential derived from information received at registration. Each credential allows only a single login in any authentication window, or epoch. Logins are anonymous in the sense that the service cannot distinguish which user is logging in any better than random guessing. This implies unlinkability of a user across different logins. We find that a central tension in an anonymous subscription service is the service provider’s desire for a long epoch (to reduce server-side computation) versus users’ desire for a short epoch (so they can repeatedly “re-anonymize” their sessions). We balance this tension by having short epochs, but adding an efficient operation for clients who do not need unlinkability to cheaply re-authenticate themselves for the next time period. We measure performance of a research prototype of our pro- tocol that allows an independent service to offer anonymous access to existing services. We implement a music service, an Android-based subway-pass application, and a web proxy, and show that adding anonymity adds minimal client latency and only requires 33 KB of server memory per active user

    Assise: Performance and Availability via NVM Colocation in a Distributed File System

    Full text link
    The adoption of very low latency persistent memory modules (PMMs) upends the long-established model of disaggregated file system access. Instead, by colocating computation and PMM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol for managing a set of server-colocated PMMs as a fast, crash-recoverable cache between applications and slower disaggregated storage, such as SSDs. Unlike disaggregated file systems, Assise maximizes locality for all file IO by carrying out IO on colocated PMM whenever possible and minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics. Assise promises to beat the MinuteSort world record by 1.5x
    • …
    corecore