34 research outputs found
Recommended from our members
Abstract Models of Memory Management
Most specifications of garbage collectors concentrate on the low-level algorithmic details of how to find and preserve accessible objects. Often, they focus on bit-level manipulations such as "scanning stack frames," "marking objects," "tagging data," etc. While these details are important in some contexts, they often obscure the more fundamental aspects of memory management: what objects are garbage and why? We develop a series of calculi that are just low-level enough that we can express allocation and garbage collection, yet are sufficiently abstract that we many formally prove the correctness of various memory management strategies. By making the heap of a program syntactically apparent, we can specify memory actions as rewriting rules that allocate values on the heap and automatically dereference pointers to such objects when needed. this formulation permits the specification of garbage collection as a relation that removes portions of the heap without affecting the outcome of the evaluation. Our high-level approach allows us to specify in a compact manner a wide variety of memory management techniques, including standard trace-based garbage collection (i.e., the family of copying and mark/sweep collection algorithms), generational collection, and type-based, tag-free collection. Furthermore, since the definition of garbage is based on the semantics of the underlying language instead of the conservative approximation of inaccessibility, we are able to specify and prove the idea that type inference can be used to collect some objects that are accessible but never used.Engineering and Applied Science
Divided we stand: Parallel distributed stack memory management
We present an overview of the stack-based memory management techniques that we used in our non-deterministic and-parallel Prolog systems: &-Prolog and DASWAM. We believe
that the problems associated with non-deterministic and-parallel systems are more general than those encountered in or-parallel and deterministic and-parallel systems, which can be seen as subsets of this more general case. We develop on the previously proposed "marker scheme", lifting some of the restrictions associated with the selection of goals while keeping (virtual) memory consumption down. We also review some of the other problems associated with the stack-based management scheme, such as handling of forward and backward execution, cut, and roll-backs
Garbage collection auto-tuning for Java MapReduce on Multi-Cores
MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests
QTM: computational package using MPI protocol for quantum trajectories method
The Quantum Trajectories Method (QTM) is one of {the} frequently used methods
for studying open quantum systems. { The main idea of this method is {the}
evolution of wave functions which {describe the system (as functions of time).
Then,} so-called quantum jumps are applied at {a} randomly selected point in
time. {The} obtained system state is called as a trajectory. After averaging
many single trajectories{,} we obtain the approximation of the behavior of {a}
quantum system.} {This fact also allows} us to use parallel computation
methods. In the article{,} we discuss the QTM package which is supported by the
MPI technology. Using MPI allowed {utilizing} the parallel computing for
calculating the trajectories and averaging them -- as the effect of these
actions{,} the time {taken by} calculations is shorter. In spite of using the
C++ programming language, the presented solution is easy to utilize and does
not need any advanced programming techniques. At the same time{,} it offers a
higher performance than other packages realizing the QTM. It is especially
important in the case of harder computational tasks{,} and the use of MPI
allows {improving the} performance of particular problems which can be solved
in the field of open quantum systems.Comment: 28 pages, 9 figure
The economics of garbage collection
This paper argues that economic theory can improve our understanding of memory management. We introduce the allocation curve, as an analogue of the demand curve from microeconomics. An allocation curve for a program characterises how the amount of garbage collection activity required during its execution varies in relation to the heap size associated with that program. The standard treatment of microeconomic demand curves (shifts and elasticity) can be applied directly and intuitively to our new allocation curves. As an application of this new theory, we show how allocation elasticity can be used to control the heap growth rate for variable sized heaps in Jikes RVM
Lock-Free and Practical Deques using Single-Word Compare-And-Swap
We present an efficient and practical lock-free implementation of a
concurrent deque that is disjoint-parallel accessible and uses atomic
primitives which are available in modern computer systems. Previously known
lock-free algorithms of deques are either based on non-available atomic
synchronization primitives, only implement a subset of the functionality, or
are not designed for disjoint accesses. Our algorithm is based on a doubly
linked list, and only requires single-word compare-and-swap atomic primitives,
even for dynamic memory sizes. We have performed an empirical study using full
implementations of the most efficient algorithms of lock-free deques known. For
systems with low concurrency, the algorithm by Michael shows the best
performance. However, as our algorithm is designed for disjoint accesses, it
performs significantly better on systems with high concurrency and non-uniform
memory architecture
PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development
This paper describes PlinyCompute, a system for development of
high-performance, data-intensive, distributed computing tools and libraries. In
the large, PlinyCompute presents the programmer with a very high-level,
declarative interface, relying on automatic, relational-database style
optimization to figure out how to stage distributed computations. However, in
the small, PlinyCompute presents the capable systems programmer with a
persistent object data model and API (the "PC object model") and associated
memory management system that has been designed from the ground-up for high
performance, distributed, data-intensive computing. This contrasts with most
other Big Data systems, which are constructed on top of the Java Virtual
Machine (JVM), and hence must at least partially cede performance-critical
concerns such as memory management (including layout and de/allocation) and
virtual method/function dispatch to the JVM. This hybrid approach---declarative
in the large, trusting the programmer's ability to utilize PC object model
efficiently in the small---results in a system that is ideal for the
development of reusable, data-intensive tools and libraries. Through extensive
benchmarking, we show that implementing complex objects manipulation and
non-trivial, library-style computations on top of PlinyCompute can result in a
speedup of 2x to more than 50x or more compared to equivalent implementations
on Spark.Comment: 48 pages, including references and Appendi
LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version]
CORRECTION: The authors for entry [4] in the references should have been "E. S. Chung,
J. C. Hoe, and K. Mai".Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer management in an H.264 decoder and memory management within a processor microarchitecture timing model
DSM64: A Distributed Shared Memory System in User-Space
This paper presents DSM64: a lazy release consistent software distributed shared memory (SDSM) system built entirely in user-space. The DSM64 system is capable of executing threaded applications implemented with pthreads on a cluster of networked machines without any modifications to the target application. The DSM64 system features a centralized memory manager [1] built atop Hoard [2, 3]: a fast, scalable, and memory-efficient allocator for shared-memory multiprocessors.
In my presentation, I present a SDSM system written in C++ for Linux operating systems. I discuss a straight-forward approach to implement SDSM systems in a Linux environment using system-provided tools and concepts avail- able entirely in user-space. I show that the SDSM system presented in this paper is capable of resolving page faults over a local area network in as little as 2 milliseconds.
In my analysis, I present the following. I compare the performance characteristics of a matrix multiplication benchmark using various memory coherency models. I demonstrate that matrix multiplication benchmark using a LRC model performs orders of magnitude quicker than the same application using a stricter coherency model. I show the effect of coherency model on memory access patterns and memory contention. I compare the effects of different locking strategies on execution speed and memory access patterns. Lastly, I provide a comparison of the DSM64 system to a non-networked version using a system-provided allocator