521,862 research outputs found

    Plate : persistent memory management for nonvolatile main memory

    Get PDF
    Over the past few years, nonvolatile memory has actively been researched and developed. Therefore, studying operating system (OS) designs predicated on the main memory in the form of a nonvolatile memory and studying methods to manage persistent data in a virtual memory are crucial to encourage the widespread use of nonvolatile memory in the future. However, the main memory in most computers today is volatile, and replacing highcapacity main memory with nonvolatile memory is extremely cost-prohibitive. This paper proposes an OS structure for nonvolatile main memory. The proposed OS structure consists of three functions to study and develop OSs for nonvolatile main memory computers. First, a structure, which is called plate, is proposed whereby persistent data are managed assuming that nonvolatile main memory is present in a computer. Second, we propose a persistent-data mechanism to make a volatile memory function as nonvolatile main memory, which serves as a basis for the development of OSs for computers with nonvolatile main memory. Third, we propose a continuous operation control using the persistent-data mechanism and plates. This paper describes the design and implementation of the OS structure based on the three functions on The ENduring operating system for Distributed EnviRonment and describes the evaluation results of the proposed functions

    TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests

    Full text link
    Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program's execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis. To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation. Given x86t_elt and a synthesis bound as input, TransForm's synthesis engine successfully produces a set of ELTs including relevant ELTs from prior work.Comment: *This is an updated version of the TransForm paper that features updated results reflecting performance optimizations and software bug fixes. 14 pages, 11 figures, Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    High performance large sparse PDEs with parabolic and elliptic types using AGE method on DPCS

    Get PDF
    The computational analysis of three case studies using parallelization of Alternating Group Explicit (AGE) solver is presented. Based on (2×2) block system and splitting strategy, AGE with Douglas-Richford and Brian variances are applied to simulate the large sparse PDEs applications with parabolic and elliptic types. The applications are heat equation, food dehydration for preservation and breast cancer growth. The AGE method has proved to be stable and suitable for parallel computing as it possesses separately and independently. The performance of AGE is compared with classical iterative methods such as Red Black Gauss Seidel (RBGS) and Jacobi (JB) methods. Since the PDEs applications are large sparse problems, we apply the AGE method in three different applications with three different mathematical models. The parallel implementation is based on SIMD model and supported by distributed memory architecture. Therefore, some numerical analysis and parallel performance indicators are used to validate the superior of parallel AGE method in terms of time execution, speedup, efficiency and effectiveness. As a result, the performances of numerical analysis and parallel evaluation of AGE are found to be effective for solving three case studies in reducing data storage accesses and minimizing communication time on a distributed parallel computer system
    corecore