3 research outputs found

    CAS-DSM: A Compiler Assisted Software Distributed Shared Memory

    Full text link
    Traditional software Distributed Shared Memory (DSM) systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. The resulting involvement of the OS (kernel) and the associated overhead which is significant, can be avoided by careful compile time analysis and code instrumentation. In this paper, we propose such a Compiler Assisted Software support approach (CAS-DSM). In the CAS-DSM implementation, the involvement of the OS kernel is avoided by instrumenting the application code at the source level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations. We modified CVM, a publicly available software DSM to support the instrumentation inserted by the compiler. Detailed performance evaluation of CAS-DSM is reported using a set of Splash/Splash2 parallel application benchmarks on a distributed memory IBM SP-2 machine. CAS-DSM achieved moderate to good performance improvements for most of the applications compared to the original CVM implementation. Reducing the overheads in polling-based implementation improves the performance of CAS-DSM significantly resulting in an overall improvement of 12–52% over the original CVM implementation.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44573/1/10766_2004_Article_482234.pd

    Silkroad : A system supporting DSM and multiple paradigms in cluster computing

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Memory-Constrained Computing

    Get PDF
    University of Minnesota Ph.D. dissertation.November 2017. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); x, 126 pages.The growing disparity between data set sizes and the amount of fast internal memory available in modern computer systems is an important challenge facing a variety of application domains. This problem is partly due to the incredible rate at which data is being collected, and partly due to the movement of many systems towards increasing processor counts without proportionate increases in fast internal memory. Without access to sufficiently large machines, many application users must balance a trade-off between utilizing the processing capabilities of their system and performing computations in memory. In this thesis we explore several approaches to solving this problem. We develop effective and efficient algorithms for compressing scientific simulation data computed on structured and unstructured grids. A paradigm for lossy compression of this data is proposed in which the data computed on the grid is modeled as a graph, which gets decomposed into sets of vertices which satisfy a user defined error constraint, epsilon. Each set of vertices is replaced by a constant value with reconstruction error bounded by epsilon. A comprehensive set of experiments is conducted by comparing these algorithms and other state-of-the-art scientific data compression methods. Over our benchmark suite, our methods obtained compression of 1% of the original size with average PSNR of 43.00 and 3% of the original size with average PSNR of 63.30. In addition, our schemes outperform other state-of-the-art lossy compression approaches and require on the average 25% of the space required by them for similar or better PSNR levels. We present algorithms and experimental analysis for five data structures for representing dynamic sparse graphs. The goal of the presented data structures is two fold. First, the data structures must be compact, as the size of the graphs being operated on continues to grow to less manageable sizes. Second, the cost of operating on the data structures must be within a small factor of the cost of operating on the static graph, else these data structures will not be useful. Of these five data structures, three are approaches, one is semi-compact, but suited for fast operation, and one is focused on compactness and is a dynamic extension of any existing technique known as the WebGraph Framework. Our results show that for well intervalized graphs, like web graphs, the semi-compact is superior to all other data structures in terms of memory and access time. Furthermore, we show that in terms of memory, the compact data structure outperforms all other data structures at the cost of a modest increase in update and access time. We present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original program, speedups from 2x--12x over a standard BDMPI implementation can be achieved for the included applications. We present a runtime system designed to be used alongside data parallel OpenMP programs for shared-memory problems requiring out-of-core execution. Our new runtime system, which we call OpenOOC, exploits the concurrency exposed by the OpenMP semantics to switch execution contexts during non-resident memory access to perform useful computation, instead of having the thread wait idle. Benchmarking using a synthetic application shows that modern operating systems support the necessary memory and execution context switching functionalities with high-enough performance that they can be used to effectively hide some of the overhead incurred when swapping data between memory and disk in out-of-core execution environments. Furthermore, we tested OpenOOC with practical computational application and our results show that with no structural modification to the original program, runtime can be reduced by an average of 21% compared with the out-of-core equivalent of the application
    corecore