An object-aware memory architecture  by Wright, Greg et al.
Science of Computer Programming 62 (2006) 145–163
www.elsevier.com/locate/scico
An object-aware memory architecture
Greg Wright∗, Matthew L. Seidl, Mario Wolczko
Sun Microsystems, Inc., Menlo Park, CA 94025, USA
Received 29 April 2005; received in revised form 22 December 2005; accepted 24 February 2006
Available online 11 July 2006
Abstract
Despite its dominance, object-oriented computation has received scant attention from the architecture community. We propose
a novel memory architecture that supports objects and garbage collection (GC). Our architecture is co-designed with a JavaTM
Virtual Machine (JVMTM)1 to improve the functionality and efficiency of heap memory management. The architecture is based
on an address space for objects accessed using object IDs mapped by a translator to physical addresses. To support this, the
system includes object-addressed caches, a hardware GC barrier to allow in-cache GC of objects, and an exposed cache structure
cooperatively managed by the JVM. These extend a conventional architecture, without compromising compatibility or performance
for legacy binaries.
Our innovations enable various improvements such as: a novel technique for parallel and concurrent garbage collection, without
requiring any global synchronization; an in-cache garbage collector, which never accesses main memory; concurrent compaction of
objects; and elimination of most GC store barrier overheads. We compare the behavior of our system against that of a conventional
generational garbage collector, both with and without an explicit allocate-in-cache operation which eliminates many write misses.
Our scheme additionally trades L2 misses for in-cache operations, and provides the mapping indirection required for concurrent
compaction.
c© 2006 Sun Microsystems Inc. Published by Elsevier B.V. All rights reserved.
Keywords: Garbage collection; Virtual machines; Object memory
1. Introduction and motivation
Object-oriented programming is the dominant software development paradigm, and has been so for the last decade.
Object-oriented programming languages, such as JavaTM and C#, have converged on a common object model whose
roots can be found in Smalltalk [1]. For our purposes we can summarize the model with three properties. Firstly,
objects are small, cheap and plentiful. Secondly, object storage is reclaimed automatically through garbage collection
(GC). Thirdly, although object references are unforgeable, all permissions are managed through checks done at class-
loading or compile time (the Java language combines static type safety with package-based access controls). Although
our work focuses on Java it is equally applicable to other languages with a similar model; we note that it may not
∗ Corresponding author. Tel.: +1 650 786 6568.
E-mail addresses: Greg.Wright@Sun.com (G. Wright), Matthew.Seidl@Sun.com (M.L. Seidl), Mario.Wolczko@Sun.com (M. Wolczko).
1 Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries.
0167-6423/$ - see front matter c© 2006 Sun Microsystems Inc. Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.scico.2006.02.007
146 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
provide any benefits for some other language models (most notably C++ and the unsafe subset of C#, which allow
arbitrary pointer arithmetic).
Computer architecture has mostly ignored this object model. In particular the requirements of, and constraints
on, the memory system are quite different from those of a flat, paged virtual memory — for example, small objects
reduce the effectiveness of long cache lines and TLBs [2] because inter-object locality is not assured. Memory system
performance is important in the commercial server market, where Java applications may require many gigabytes of
heap space. The primary architectural interface to Java applications is at the virtual machine (JVM) level, and this
gives considerable freedom to innovate across the hardware/software boundary. We are investigating how hardware
support for objects, co-designed with the virtual machine, can lead to better memory system performance and also
enable new memory management algorithms which cannot currently be implemented efficiently in software.
We assume that changes can be made to the virtual machine, operating system and hardware, but we do not consider
a completely ‘clean sheet’ design: legacy applications must run unchanged on the same system. We therefore extend
a conventional instruction set architecture where appropriate, retaining backwards compatibility. The architectural
modifications are only support for the software algorithms; we propose hardware structures only where they offer
functionality or efficiency that software cannot match, and leave as much flexibility as possible for later software
innovation. In particular, we are not proposing hardware implementations of complete garbage collectors [3,4],
bytecode execution [5,6], or fine-grained protection [7–9].
Our approach is centered around two ideas: support for objects as first-class entities in the memory system,
and collaboration between hardware and software for memory management. The fundamentals of the proposed
architecture are presented in Section 2, and GC-related extensions are described in Section 3. Comparisons to related
work are in Section 4, then Section 5 describes our evaluation methodology and Section 6 presents the results of our
initial evaluation. In Section 7 we present conclusions, and Section 8 suggests future work.
2. A memory hierarchy for objects
Current production JVMs use a direct pointer representation for object references: each object is referenced by its
base virtual address. The conventional virtual memory system maps virtual to physical addresses. This is in contrast
to early virtual machines, such as Smalltalk VMs of the ’80s [1], which used an indirect representation: objects were
referred to by a location-independent Object ID (OID), actually an index into an object table. An object’s object
table entry (OTE) contained the object’s virtual memory address. The direct representation saves a (dependent) load
instruction on field access by the mutator (application code, as distinct from the garbage collector); the indirect
representation makes relocation easier because an object’s memory address is stored in only one place. Relocation
is an important component of a variety of memory management functions, most notably heap compaction which
overcomes fragmentation. However, since the late ’80s direct access has been the method of choice, because the
indirection overhead of the object table has been too high [10].
The temporal properties of memory management algorithms are important when scaling to large heaps: it is
desirable to have good average mutator throughput, i.e., low memory management overhead, and also to ensure
that pauses, during which mutator work is suspended, are kept within acceptable bounds. Concurrent, or mostly-
concurrent [11], garbage collection is very attractive for avoiding lengthy pauses when operating on multi-gigabyte
heaps: here the collector runs at the same time as the mutator, with some synchronization mechanisms (barriers,
mutator thread suspension, etc.) to ensure correctness. Concurrent compaction is more difficult: an object must be
relocated with apparent atomicity to the mutator. With a direct pointer representation this means updating all references
to an object “at once” (and is strictly harder than merely determining that an object is reachable, which requires only
reading a single reference). An indirect representation implemented on a conventional architecture still requires extra
synchronization in the mutator because the load–load or load–store indirection through the object table is not atomic.
Architectural support for objects can provide the best of both worlds: fast access to objects in the common case,
with easy relocation. With little hardware support we can also perform synchronized relocation with no overhead in the
mutator: this allows truly concurrent compaction. With concurrent relocation we can move objects while all mutator
threads are active, thus allowing various optimizations to be performed on the heap and improving the maximum heap
sizes that can be handled without inducing unacceptable garbage collection pauses.
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 147
Fig. 1. Finding a physically-tagged datum in a conventional cache.
2.1. Object-addressed caches
In a conventional system the instruction to load an object field takes two source operands: the reference to the object
(i.e., its base virtual address) and the field offset within the object. The pointer and offset are summed to produce the
field’s virtual address which is fed to the TLB and (physically-tagged) L1 cache (Fig. 1).
An object-addressed cache [12] instead has cache lines tagged directly with (OID, offset) addresses, and thus
contains parts of objects rather than blocks of physical memory (Fig. 2). This organization is similar to a virtually-
addressed cache, except that the OID and offset bits are concatenated rather than summed (a slight simplification
for now, see more details in Section 2.2 below); there is no aliasing of object addresses. An object–load instruction
takes the same two source operands as the conventional load, but rather than being added they bypass the TLB and
go directly to the cache index/tag match hardware. The tags of an object-addressed cache must be longer to handle
the desired (virtual) size of object space. The maximum object size can be restricted: 10 offset bits (256 fields of 4
bytes each) cover the majority of objects in practice, and longer objects can be composed by the virtual machine in
much the same way that a file system provides the illusion of large contiguous files using small disk blocks. Object
addresses and tags may therefore be only a few bits longer than the physical addresses already handled in upcoming
systems. The OID address space is a shared (non-virtualized) resource, but multiple application-level processes may
reside within a single JVM [13]. We assume that the operating system coordinates the partitioning of the OID space
if several JVMs are required. Alternatively, the OID space could be virtualized with the addition of extra hardware, at
some penalty in cost and performance.
148 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
Fig. 2. An object-addressed cache (simplified).
Each object cache line contains part of only one object; objects can span multiple cache lines, but most objects will
occupy only one (given 64-byte cache lines). Some space at the ends of lines will be unused (external fragmentation).
The fragmentation has two potentially negative effects on the system. Firstly, it reduces the effective cache capacity:
with denser packing more objects would fit into the cache. Secondly, it removes the possibility that a cache miss
will prefetch an adjacent object. Both the packing and prefetching rely on spatial locality between the objects: in its
absence the adjacent objects are not used and provide no benefit. In our system, objects are allocated directly in the
cache (Section 2.3); the prefetching effect is only relevant for objects old enough to have been previously evicted
from the cache. Some studies have demonstrated spatial locality between recently-allocated objects but little for older
objects [14], thus we would not expect much benefit from prefetching older objects in any case. Both of these effects
are of course reflected in our results: they are costs which we pay in exchange for the other benefits of our system. An
additional benefit of an object cache is that it precludes false sharing between different objects.
Following our principle that object functionality merely extends an existing system, we allow ordinary physically-
tagged cache lines to coexist in the object cache with the object-tagged cache lines. As long as the tags do not collide,
ensured for example by embedding the object addresses into a higher address range than physical memory (e.g., a set
high-order bit), the cache hardware operates in the usual manner after the address generation stage and is unaware
that it is dealing with objects. We will use ‘extended physical address’ (EPA) to mean an address (bit string) which is
either a physical memory location or an encoded (OID, offset) address.
Most cache coherence mechanisms work unchanged when using EPAs instead of straightforward physical
addresses; snoops from other CPUs destined for object addresses will match on the encoded tags and change coherence
states in the usual manner. The unusual cases are cache misses in all levels of the hierarchy, which necessitate a
memory access (see Section 2.4), and the directories in directory-based protocols. Although directory-based object
coherence is possible, details are beyond the scope of this paper.
2.2. Object address encoding
A system implemented exactly as described above would have poor placement of objects within the cache. Non-
fully-associative caches are conventionally indexed using a subset of the (physical) address bits. This has the effect
of distributing consecutive addresses through the cache in the hope of reducing conflicts, and it is cheap to compute.
With objects in the cache the index should neither consist solely of OID bits (so that the several cache lines of large
objects conflict within one cache set), nor only of offset bits (so that small objects are restricted to a portion of the
cache). Instead, we encode (OID, offset) addresses into the EPA range using an invertible mapping which combines
part of the OID and offset bits into the index bits. The inclusion of the offset bits maximally spreads out the cache
lines of a single object, and cache lines from different objects are distributed through the cache by the OID bits. An
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 149
Fig. 3. Object address encoding (example with 40 bit OID + 12 bit offset).
Fig. 4. A small object with and without skewing.
example using exclusive–or combination is illustrated in Fig. 3. Indexing by bit extraction will have the desired effect,
and the ‘real’ OID and offset (when needed) can be recovered by inverting the encoding.
One other operation is performed as part of the encoding. A bifurcated object layout [15,16] supports fast
identification of references for the garbage collector: the object header is in the middle of the object with references at
negative offsets and primitive values at positive offsets (Java’s fields are declared as containing one or the other, unlike
Smalltalk’s). In a straightforward implementation even small objects would occupy two cache lines, an intolerable
overhead. We therefore skew objects within cache lines: the OID contains a small number of bits specifying where
within its cache line the object header (logical offset zero) should lie. Fig. 4 shows how an object is skewed so that the
reference fields fit within the same cache line as the object header. The skew value is added to the requested logical
offset as part of the address generation, when the (OID, offset) address is encoded to form the EPA.
Fig. 5 shows how Fig. 4 is modified to include skew; here three skew bits are taken from the high-order bits of the
OID, allowing skewing to any 8-byte boundary within the 64-byte cache line. The skewing is mostly transparent to
software; only the allocator and garbage collector, which must know how fields are organized within cache lines, are
aware. The desired skew value is constant for a particular class.
2.3. Cache management: Object allocation and deallocation
Caches in conventional systems are designed to operate mostly transparently to application code, although some
parts of the operating system may need to be aware of the cache organization. Hints, such as prefetch instructions,
may be provided for performance but are semantically neutral. In our case, co-design of the hardware and virtual
150 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
Fig. 5. Object address encoding (example with skew).
machine allows better communication of intent. For example, when allocating a new object there is no need to fetch
the contents from memory into cache—the constructor will overwrite all the data. A zero-and-allocate instruction, like
the PowerPC DCBZ (data cache block zero) [17], requests both that a cache line be allocated writable in the cache
and that the contents be set to zero. Similarly, once an object is known to be garbage there is no need to write back its
contents from the cache to memory; freeing its cache lines immediately will prevent unnecessary memory traffic and
enable better placement of new objects in the cache. Variants of both of these operations were proposed in [18].
2.4. Translation
An object cache allows fast access to objects, in the common (hit) case, using only the location-independent
object ID. In the case of a cache miss or eviction the system must still provide a means of retrieving or storing the
object’s state; translation involves looking up the object’s physical address in the object table and then reconciling the
contents of the object cache line with the in-memory representation of the object. We accomplish this with a hardware
mechanism logically interposed between the cache hierarchy and main memory (Fig. 6). The translator intercepts
cache misses to object EPAs and, using a simple state machine, reads the object table entry to get the physical address
and generates (cache coherent) fetches of the necessary physical cache lines; the object table lookup is similar to a
hardware page-table walk. A single object cache line may overlap one or two physical cache lines—objects in memory
are word-aligned, not cache-line aligned, to avoid the loss of heap memory to fragmentation. Heap fragmentation is
less acceptable than fragmentation in the caches because the caches can contain only a small fraction of a multi-
gigabyte heap, but main-memory DRAM takes a very significant component of a machine’s power and cost budgets.
The object table entry also contains size information so that the ends of objects are handled correctly. A logical view
of the object table is shown in Fig. 7, but it is not a physically separate structure; as with conventional page tables, the
object table is stored in coherent memory and the VM manipulates it with loads and stores.
It should be emphasized that the object cache line representation of an object is not kept coherent with the view
of its physical memory representation as seen through conventional memory accesses. Coherence works completely
independently on the object cache lines and physically-tagged cache lines; the translator only reconciles the two when
an object cache line is constructed on a cache miss or written back on eviction. The mutator’s view of the object is
solely through the object cache lines. The virtual machine must take care not to modify an object using conventional
loads and stores, which is easily achieved by never mapping the heap into the conventional virtual address space. The
CPUs have coherent access to an object’s physical storage only for internal VM functions such as relocation.
2.5. Concurrent object relocation
A long-running object system must relocate live objects to avoid heap fragmentation; this heap compaction should
be concurrent or incremental to avoid lengthy mutator pauses. As discussed above, indirect references through an
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 151
Fig. 6. Schematic of the proposed architecture. The unshaded boxes represent new units; dashed lines are new data paths. Upper diagram represents
a 2-CPU single-chip system, as simulated. Lower diagram shows extensions: multiple-CMP system, caches outside the GC boundary.
Fig. 7. The organization of an object within the cache and in memory, and its associated object table entry.
object table can make relocation much easier because the physical address of an object is stored in only one place.
Here we describe how to synchronize translation and relocation in the object memory hierarchy so that mutation, with
its implied translations, can proceed concurrently with compaction.
The basic model for copying relocation is that the compactor reads the contents of the physical memory backing
the object, copies those data to the new location, and then updates the object table entry to point to the new area.
The reading and writing are both performed using ordinary loads and stores operating on physical addresses; these
operations do not pass through the translator. While the copying is taking place there are two concurrency problems
with respect to translation: firstly, a write-translation (cache eviction) of an object cache line must be reflected in the
152 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
copy, and, secondly, a read-translation must use the old location until the copy is complete. The read problem is easier:
in the absence of write translations the two versions are identical when the copy is done, so apparent atomicity with
respect to the OTE update is not critical.
We solve the write-translation problem by using a single ‘evicted’ bit in each OTE. Whenever the translator
performs a write-translation on part of an object it sets this bit atomically with respect to the update of the physical
cache lines, for example by holding the cache line containing the OTE until the updates are complete. The compactor
clears the bit before starting a relocation, copies the physical memory, and then swaps the new physical address into the
OTE using a compare-and-swap (CAS) operation. The CASwill fail and the copy be retried if the translator meanwhile
processed an eviction of any part of that object; if the CAS succeeds the new location is atomically switched in, and
at that point the copies are identical.
2.6. Summary
We have described a memory hierarchy for objects, offering concurrent relocation and the flexibility of an object
table with no code size or speed penalty (in the common case). An object cache holds parts of objects, tagged directly
with the location-independent object ID and offset, alongside conventional physically-tagged cache lines. The cache’s
index function for object addresses is determined by the choice of encoding into the extended address range. Object
cache lines are kept coherent using the normal cache coherence mechanism. A hardware translator handles conversion
of object cache lines to and from the representation in physical memory in the case of a cache miss or eviction.
The memory hierarchy for objects forms a base for future work on highly scalable, concurrent memory management
algorithms. The next section presents one possibility, building on the features of the object cache.
3. In-cache garbage collection
Garbage collection is an inefficient process for caches: typically the object graph is traversed by a mark-sweep
collector touching reference fields in each live object exactly once; although there is some spatial locality within
objects there is little locality between objects referenced in this pattern [19]. If most accesses are to recently allocated
objects, the caches and the youngest generation of a generational garbage collector hold similar portions of the heap.
Objects are certainly allocated in the cache; short-lived objects will also die in the cache. With some hardware support,
we can build a young generation garbage collector which operates entirely within the object caches, with almost no
external memory references required: in-cache garbage collection. This gives fast, predictable collections without
global memory traffic or synchronization and saves the writing back of garbage objects to memory.
3.1. Garbage collection boundaries
For the purposes of in-cache garbage collection we group the CPUs and their associated caches within one or
more GC boundaries. A GC boundary delimits a group of CPUs on which mutation will be paused and the garbage
collector will run, and the caches within which collection will take place; we expect fast communication within the GC
boundary. In the case of a single-chip multiprocessor (CMP) with several on-chip L1s and a shared L2 the natural GC
boundary coincides with the chip boundary and contains all the on-chip CPUs and caches (Fig. 6). A multiprocessor
built from several such CMPs would have one GC region per chip. For concreteness we will assume this organization,
but others are possible.
3.2. Local objects and the GC barrier
An object is eligible for in-cache collection if it was created inside the GC boundary and a reference to it has never
left the boundary, i.e., all references to the object are known to remain in the on-chip caches [20]. The liveness of
such a local object can be determined without global memory operations by examining the contents of the caches; no
thread executing on a remote CPU can get hold of a reference without communication through shared memory. An
object is said to be non-local if this is no longer true: a reference to it has left the cache. Non-local objects are assumed
to be live; any non-local objects in the caches are roots for the in-cache collection.
We track the local state of objects with one extra state bit per cache line. The non-local bit of an object’s header
cache line (the cache line containing offset zero) is set when a reference to that object leaves the GC boundary.
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 153
Fig. 8. Extra GC state maintained in an object cache.
To maintain this barrier a piece of hardware, which we call the warden, examines outgoing cache lines to find
the references. In a broadcast-non-local request, the warden broadcasts the referenced OIDs to caches within the
boundary, rather like a snoop-invalidation; a tag match on the header cache line will set the non-local bit. The warden
processes references from other objects (object cache lines), and from VM data structures and stacks in physically-
tagged cache lines. The warden also sets the non-local bit of any incoming cache lines: we cannot track references
when an object is outside the cache, so objects which have left the cache are conservatively marked non-local.
The set of local objects roughly corresponds to the youngest generation of a generational garbage collector [19].
An object is promoted out of the in-cache generation when a reference to it escapes the GC boundary, and the warden
maintains the associated GC barrier.
3.3. Locating references and roots
References inside objects can be identified using knowledge of the bifurcated object layout. References in
physical memory (i.e., on the stack or in other VM data structures and hence roots for the in-cache collection) are
indistinguishable from integers, and are conventionally located using stack maps, custom data structure parsers, etc.,
but having the warden process these complex data structures is undesirable. A more general solution involves tagged
memory: an extra bit added to each word in the system distinguishes references from integers.
We can get the benefits of tagged memory for in-cache collection but without the expense of 33- or 65-bit memory
throughout the system using modified reference bits only within the caches; the cost is one extra bit per word of cache.
This works because the warden and in-cache collector only need to find references to local objects; a cache line outside
the GC boundary cannot contain any local references because of the action of the warden. A store-reference instruction
sets the modified-reference bit, and a store-integer instruction clears it, whether in object or non-object memory. The
warden issues broadcast-non-local requests for outgoing words with set modified-reference bits. Similarly, the in-
cache collector can find roots outside objects by looking for modified references within the GC boundary.
There is some redundancy in using the modified-reference bits along with the bifurcated object layout, at least for
the in-cache collector. The modified-reference bits’ main purpose is locating references outside objects, but they still
serve as an optimization within objects: they reduce the number of references processed by the warden and collector.
Given the modified-reference bits, the bifurcated layout is not required, but it reduces the number of cache lines
examined by the collector and is also useful for old-space collection.
3.4. The in-cache collection algorithm
The in-cache garbage collector (ICGC) is a simple mark-sweep garbage collector but with the hardware introduced
above it is both parallel, using all the CPUs within the GC boundary during a collection, and concurrent, in that
mutation may continue on other CPUs outside the GC boundary. The GC uses two extra bits per cache line (Fig. 8),
labeling them as the mark bit and the done bit. The mark bit of the object header line designates a live object and the
done bit indicates an object line that has been scanned. In the common tri-color terminology, the mark bit implies grey
or black (will be or has been scanned), and the done bit indicates black.
154 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
Each CPU within the boundary iterates over any caches private to it and a portion of any shared cache. Every cache
line that contains a piece of an object is scanned if that object’s header has either of the mark or non-local bits set.
Any modified references within the object piece are then broadcast-marked; this is the same operation triggered by
the warden except that it sets the mark bit instead of the non-local bit and is exposed to software control. During the
GC phase, any broadcast which hits the unmarked header of an object places the OID on a queue, associated with
that cache, for newly marked (grey) objects. The responsible CPU then pops the OID from this mark queue and scans
that object if it resides in the cache. The queue is an optimization: the alternative is to re-scan (parts of) the cache to
find the newly marked lines. In the simulations described later we assume an unlimited queue size; a more detailed
discussion of queue overflow is beyond the scope of this paper.
Roots for the in-cache collection are non-local objects, object fragments (parts of objects whose header is outside
the GC boundary), and modified references contained in physically-tagged cache lines. With the modified reference
bit we do not need to use stack maps or other more complex techniques for locating references on the stack or locations
outside the heap. A single pass through the cache locates all the roots; recursive reachability is handled through the
broadcast mechanism.
We expose the contents of the cache for inspection by the collector: a mechanism is provided to iterate over the
cache sets returning the tag and state (non-local, marked, etc.). Based on this state the GC issues ordinary loads and
stores to read the contents of those cache lines where necessary.
The mark phase ends when all non-local and marked objects have been scanned. We can then reclaim the OIDs and
cache lines of any objects that do not have either of these bits set. We reclaim the cache lines by simply invalidating
their contents during another pass through the cache. Invalidation prevents those contents, which are no longer valid,
from being written back to main memory, frees space for new in-cache allocations, and saves bus traffic.
This algorithm is parallel; every CPU within the boundary can be scanning objects simultaneously. GC threads
communicate through the mark queues described earlier. A final synchronization phase ensures that all processes are
ready to move onto the sweep phase.
The algorithm is also concurrent in that mutation outside the GC boundary can continue without any explicit
synchronization with the GC threads. Threads running outside the GC boundary can continue to use objects and
even request and obtain objects from within the GC boundary during a collection. The warden handles any objects
leaving the GC boundary during a collection just as it would during normal operation, thus ensuring that any
escaping reference is accounted for. Additionally, mutator threads that were suspended within the GC boundary can
be rescheduled to other CPUs. As each thread accesses objects and stacks it will depopulate the GC boundary being
collected, but all the objects leaving will be live—dead objects will be left behind for collection.
3.5. Limitations of in-cache GC
In-cache GC has some limitations not suffered by conventional collectors. The most fundamental is that the size
of the collectable region is fixed in hardware. A software young generation may be increased in size to fit the specific
properties of the application and improve the efficiency of GC. It is always possible to choose how frequently to run the
in-cache GC, or even disable it completely without penalizing mutator performance, but the hardware implementation
cost has already been paid.
3.6. Summary
In-cache garbage collection operates on newly-allocated objects with no cache misses and no software barrier
overhead. Although mutation is suspended in some partition of the machine the collection is concurrent with other
partitions, and collection proceeds in parallel on all CPUs within the partition. Table 1 summarizes the hardware and
ISA extensions introduced.
4. Related work
The MUSHROOM project introduced object caches [12] and in-cache GC [20], originating some of the ideas
described here. However, MUSHROOM was strictly a single-processor design, and cache filling, evictions and barrier
processing (similar to our warden and translator) were managed in software by trap handlers with no concurrency
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 155
Table 1
Summary of ISA extensions
Feature Function
Store reference at virtual address Allows references to be located within the stack or non-object data structures
Load/store primitive/reference at object address (OID, offset) Basis of object-addressed memory
Inspect cache tag & state within a particular cache set (for caches
within the GC boundary)
In-cache garbage collector scans caches to find root references
Modify cache GC state Mark cache lines Done for in-cache GC
Broadcast non-local or GC mark bits Root finding and recursion within the in-cache GC
Pop OID from mark queue (optimization) GC recursion from objects found by other CPUs
Zero and allocate object cache line In-cache object allocation
Invalidate object cache line In-cache garbage collection
required. Memory was tagged throughout, as a pure Smalltalk system, and mixed object/conventional caches were not
needed.
Object caches with delayed translation are closely related to virtually-addressed caches, for example the Berkeley
SPUR [21]. The object case is actually easier; we do not have problems with address aliasing or reverse (physical to
virtual) translations.
In-cache allocation and the tagging of garbage was proposed in [18]; the authors argued that garbage should be
detected whilst still in the cache, but did not couple GC with the contents of the cache as we do.
Various hardware GC schemes have been proposed [3,4,22,23]; our scheme provides hardware assistance for a
software-based collector.
Our system shares motivation with the co-designed virtual machines of [24] but the specific memory and GC
organization is very different.
Various studies of the interaction of GC and conventional caches have been published [25–27]. None has related
the hardware-level activity back to the high-level activity as we do in Section 6.
Our cache inspection and manipulation operations are similar in spirit to those described in [28].
Azul Systems has developed a CMP-based architecture for Java which includes hardware support for GC barriers
atop a conventional memory hierarchy [29]. Their collector requires all of the mutator threads to take several short
pauses during each collection cycle, although they are not all suspended simultaneously; relocation is handled by page-
protecting portions of the heap, causing mutators to trap and fix up references if they try to gain access. In contrast, our
in-cache GC suspends mutation within one particular GC boundary for the duration of the collection, with no impact
on mutators running elsewhere in the system, and relocation of objects may proceed completely concurrently on a
per-object basis. Our barrier processing takes place completely in hardware, while Azul’s Pauseless algorithm uses
custom instructions compiled into the instruction stream to test the barrier with a software trap handler to perform the
processing if necessary.
4.1. Unrelated work
Protection in Java is managed completely at the virtual machine level. References in our system are unprotected by
the ISA, and the CPU has no special knowledge of object references other than the store-reference instructions. We
do not require hardware protection or controlled sharing based on access control [9] or capabilities [8,30]. The CPU
core is minimally changed; it does not process objects, object layout or object-related behavior other than the load-
and store-to-object-space instructions. Unlike the Intel 432 [7], we do not support a full language-level object model
in hardware.
5. Simulation methodology
In the next section we present a preliminary evaluation of the proposed architecture. To evaluate the architecture
we have constructed a simulation driven by traces obtained from a modified JVM.
Our proposed architecture incorporates new instructions (e.g., loads and stores to object space) and hence we cannot
take instruction-level traces from an existing JVM. It would have been possible to modify an existing JVM to emit
156 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
these instructions, but then we would also have needed a matching instruction-level simulator and an implementation
of the new GC algorithm within the JVM. For a preliminary evaluation we decided to adopt a different strategy.
Rather than building a complete GC implementation we chose to construct a simulation of both the relevant hardware
structures and the proposed GC algorithm. This allows us to finesse many of the details of the implementation and
gives a much more malleable framework for experimentation.
5.1. Trace generation and simulation
The simulator is driven by traces of Java applications running on a modified JVM, called the Tracing VM (TVM)
[31]. The TVM emits object-level traces of application behavior, including all object creations, accesses to objects,
static variables and threads’ stacks (as defined in the JVM specification [32]), the loading and resolution of classes,
etc. The trace is a complete description of the execution of the program in terms of its effects on the objects, stacks
and static data, but is mostly independent of the actual implementation within the TVM. For example, each object
named in the trace is identified by a unique ID unrelated to the memory address of the object within the TVM. This
allows us to simulate diverse memory management strategies.
With this approach to simulation we are able to attribute hardware level events (such as cache misses) to their high-
level causes (application behavior, GC activity, etc.) with relative ease. The hardware-level measurements summarized
in the next section are broken out by high-level activity; we believe this is the first time this has been done.
In addition to a simulator of our proposed architecture and JVM, we also constructed a reference simulation of a
conventional architecture with a generational heap and a scavenger for the new generation, as used in many production
JVMs. This simulation resides in the same simulation framework, and is driven by the same traces, enabling a direct
comparison of the two systems. Neither system includes an old-space collector or compactor; here we are interested
in the behavior of allocation and the youngest GC generation.
One drawback of this simulation framework is that we have insufficient information to derive execution time
estimates for a plausible implementation. Our simulator models only the memory system activity induced by the
virtual machine in manipulating data; it has no notion of instructions, pipelines or latencies and hence we can only
compare memory system effects (such as cache misses). However, memory system activity is a major component of
the time spent executing most Java applications, and, since our work is focused on this area alone, we believe it is
reasonable to limit the study to this area for now. In Section 8 we will describe some possible next steps.
A consequence of our simplified simulation framework is that it does not model a variety of effects which are
important in determining real-world performance but which we believe are neutral in our comparison (or, if anything,
penalize our proposed architecture):
(1) The TVM traces describe execution in terms of simple Java bytecode interpretation: e.g., an “add” bytecode results
in two pops and a push to a thread’s stack, whereas a production JVM would use a dynamic compiler to optimize
this into a register-based operation. Hence there are many more memory operations to the stacks in our simulation
than would be present in a production system. However, these operations are the same in both simulated models;
we have subtracted out stack activity so that other activity can be compared without being polluted by the stack
data.
(2) Our simulator does not model any instruction-side activity (I-caches, or memory system activity due to fetching
instructions) or just-in-time compilation. The instruction sequences in the mutator are shorter in our proposed
architecture due to elimination of a software GC barrier, so this effect favors the reference system. The only part
of the virtual machine which would be significantly different is the GC code, which tends to be rather small and
has good locality; our proposed modifications should make it more cache-friendly.
(3) Our simulator does not model memory system activity outside the heap, stacks and static data, i.e., to JVM data
structures or native code. Our GC techniques will reduce the memory traffic to JVM data structures (e.g., there
are no data structures required to locate references on stacks or within objects). Our modifications are neutral with
respect to native code.
We model the allocation of OIDs, in the object system, in some detail. OID management is provided by building
a linked list of available objects using some of the free objects themselves. To pack more available OIDs within the
cache, we utilize the space within the free list objects to store additional free OIDs. Thus when a new OID is needed
the head of the free list is queried for a free OID value; if none is present then the head itself is popped from the



























201 compress 0.143 110.3 4,936 6.2 3.2 238 23.1 34,469 215 23
202 jess 145.9 128.9 7,024 5,316 2,618 390 27.4 49.2 360 20
209 db 50.2 27.7 4,882 3,068 146 234 16.4 189.7 210 24
213 javac 100.0 104.3 9,156 4,036 2,185 412 24.8 47.7 379 33
222 mpegaudio 0.168 0.779 5,824 7.7 4.7 284 21.8 165.7 255 29
227 mtrt 108.7 36.8 5,212 5,319 1,323 259 20.4 27.8 230 29
228 jack 114.3 101.0 6,044 3,888 2,954 289 29.4 34.2 264 25
JBB 21.2 82.9 15,484 656.6 884.0 621 28.1 93.7 569 52
Telco app 11.3 35.0 62,520 419 261 2,467 27.0 134.1 2,352 115
Table 3
Trace event statistics

















201 compress 43.6 2796.5 391.7 450.7 247.1 14,998.9 11,538.7 33,429 2.3
202 jess 6.61 491.1 37.1 103.6 22.9 2,125.6 1,626.7 2,309 46
209 db 13.9 1051.7 18.1 256.8 32.4 4,676.3 3,458.7 81.6 73
213 javac 6.89 451.0 89.3 70.1 43.0 2,244.1 1,677.8 4,650 82
222 mpegaudio 41.0 2516.9 148.3 1422.4 258.2 13,888.4 10,820.1 155,637 25
227 mtrt 8.67 654.7 53.8 77.7 8.69 2,709.9 2,122.4 74.5 5.3
228 jack 7.00 348.9 51.4 152.5 24.6 2,359.7 1,722.1 1,612 1454
JBB 4.15 171.4 21.8 66.4 55.6 1,349.1 1,090.2 14,839 1.9
Telco app 1.65 88.1 8.99 31.3 14.9 558.8 427.0 1,530 9.1
list and used. The objects containing the free list may occasionally be evicted from the cache, so it is possible that
allocation may cause a cache miss.
When objects are collected the first collected object is used as a new free list element, with subsequent free OIDs
placed within this object. When that object fills the next free object is used as a free list element, and so on. As these
objects were recently cache resident for collection, the process of building the free list encounters no cache misses.
The memory operations of our additional hardware features (the warden and translator) are included in the
simulations. We may anticipate some increase in latency when processing cache misses, but as stated above we do not
attempt to reduce our data to estimates of time.
5.2. Workloads used
To drive our simulator we took TVM traces from the following programs:
(1) The SPEC JVM98 suite, using size 100 runs [33].
(2) SPEC JBB2000 [34], using 4 warehouses, a half-size item table (10,000 entries), and an extended measurement
interval to counteract the slowdown caused by tracing.
(3) A multi-threaded telecommunications soft-switching application (‘telco app’), driven by a synthetic workload.
Table 2 contains measurements of the allocation behavior of each trace: the number of allocated bytes and objects
in the heap (for arrays and non-array objects) and static areas, average object and array sizes, and the number of
classes loaded (object and array classes). Object sizes are uniformly small; array sizes vary more. Table 3 lists the
length of each trace (measured by the number of TVM events), and the number of load and store events for objects,
arrays, stacks and static areas.
158 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
5.3. Simulation parameters
Our simulated machine (illustrated in Fig. 6) consists of two identical CPUs each with 64 KB L1 caches (8-
way associative with 64 byte cache lines). These CPUs share an 8 MB L2 cache (also 8-way associative). The cache
replacement policy used is pseudo-LRU.We present results only from this configuration as we believe it representative
of possible implementations several years out.
The reference garbage collector used two 16 MB semi-spaces with an unlimited tenured generation. Objects were
tenured after they survived three collection cycles. Smaller semi-spaces (4 MB and 8 MB) were also tried; we report
results only for 16 MB as it performed the best and is also a typical setting for production JVMs. Larger semi-spaces
lead to less frequent garbage collections, thus potentially reclaiming more garbage for less work, but lead to a bigger
total footprint and longer individual GC pauses. All threads allocate into the same semi-space, but to prevent false
sharing on allocation they claim large (64 KB) blocks which are then subdivided into objects.
The object cache’s zero-and-allocate operation prevents cache misses when allocating objects. As originally
proposed [18], such an operation could also be used in a conventional system such as our reference. We isolate its
effect by also simulating our reference system using zero-and-allocate during both object allocation and GC (copying
objects into the new semi-space).
The ICGC is initiated more frequently than the reference collector, so that allocations do not fill the cache and
begin to evict potentially reclaimable objects. We chose to have the ICGC collect every 4 MB of allocation, as that
was infrequent enough to allow garbage to accrue, but often enough that the garbage objects were not forced from the
cache by new allocations. In general, we found that, for caches of 8 MB or more, it worked well to collect after new
allocations used roughly half of the cache. As cache sizes increase the ICGC runs less often and less garbage eludes it.
6. Preliminary results
We took traces from the workloads and simulated the reference system (hereafter known as the ‘conventional
system’), with and without zero-and-allocate, and the object cache system. For each simulation we counted cache hits
and misses, reads and writes, coherence traffic, and translations. Each event was ascribed to either the collector, the
mutator’s stack activity (which we do not report, for the reasons enumerated earlier), the mutator’s heap activity, and,
for the object cache, the extra operations involved in managing the object table (‘OID management’).
Fig. 9 shows the results at each level of the memory hierarchy, normalized by the number of events in each trace
(analogous to time). For each workload the left bar is for the conventional system (C), the middle for the conventional
system with zero-and-allocate (Z), and the right for the object system (O). Each bar is further decomposed into work
performed during mutation and GC, and in the object system an additional component is due to OID management.
We have included GC Boundary (GCB) operations, which represent zero-and-allocate, cache inspection and object
management operations, in the chart with L2 hits: our GCB encloses the L2, so we expect GCB operations to have
about the same cost as L2 hits. In the case of zero-and-allocate this may be unduly generous, depending on the
implementation. The raw data for each operation are available in our technical report [35].
Examining the graphs from top to bottom, we see that the object system typically has slightly fewer L1 hits in the
mutator. The number of mutator accesses is lower because zero-and-allocate replaces writes on allocation (seen going
from C to Z), and also because card table writes are eliminated; the L1 hit ratio is slightly worse in O than Z because
of the less dense packing of objects (hence the slightly greater number of mutator L2 hits and, in some cases, misses
in the object system). The geometric mean of the ratios of L1 hits in the object system to L1 hits in the conventional
system is 0.98.
The ICGC effectively trades more L2 or GCB operations for a reduced number of L2 misses (except for
201 compress and 209 db). The geometric mean of the ratios of object L2 hits and GCB operations to conventional
L2 hits is 7.5, while for L2 misses it is 0.24. The in-cache collector takes essentially no L2 cache misses, leading
to a dramatic reduction in total L2 misses on the programs where they were significant (213 javac, JBB). Zero-and-
allocate also helps the reference system by reducing the L2 mutator (allocation) misses, but the ICGC reduces the
total misses even further. As the relative cost of L2 misses increases we expect this to result in a net performance gain
for the object system.
During collection most of the activity in the object cache is due to GCB operations (as the collector is doing
considerable cache inspection and invalidation work). Note that 222 mpegaudio does not allocate enough objects to
trigger a single conventional collection.
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 159
Fig. 9. Cache activity for each workload on conventional (C), conventional with zero-and-allocate (Z), and object (O) caches, normalized to trace
events. Top: L1 cache hits; Middle: L2 cache hits and GC boundary operations; Bottom: L2 cache misses. Each bar is further decomposed by
software activity (mutation, garbage collection and OID management).
6.1. Cache line utilization
As expected, our cache miss rates (in particular in the L1) are slightly worse because we do not pack objects
densely in cache lines. The worst benchmark in this respect is 209 db, which uses on average only 34.1% of each
cache line (32-bit references, 64-byte cache lines); in this case it interacts badly with the application’s reference
pattern, explaining the increase in L2 misses for the object system. The mean over all benchmarks is 55.5%. These
numbers improve by 10%–15% with a move to 64-bit references. Table 4 contains the complete utilization results.
This effect will be more significant for small benchmarks (such as the SPEC JVM98 suite), where even the
relatively long-lived data structures may still reside in new space. We expect less inter-object spatial locality in the
tenured data of large, long-running applications, where this effect should be less significant (research is continuing in
this direction).
160 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
Table 4
Object cache line utilization
Benchmark 32-bit references 64-bit references
Words allocated Object cache lines Utilization (%) Words allocated Object cache lines Utilization (%)
201 compress 27,608,523 1,731,333 99.7 27,632,787 1,731,898 99.7
202 jess 68,710,003 8,179,072 52.5 105,270,870 9,911,440 66.4
209 db 19,486,391 3,575,779 34.1 31,243,806 3,925,003 49.8
213 javac 51,038,971 6,739,117 47.3 68,210,751 7,110,693 60.0
222 mpegaudio 236,777 22,555 65.6 267,229 23,348 71.5
227 mtrt 36,357,334 6,670,505 34.1 45,754,612 6,742,354 42.4
228 jack 53,835,899 7,393,164 45.5 72,050,778 7,682,244 58.6
JBB 26,042,033 2,291,054 71.0 29,747,369 2,914,978 63.8
Telco app 11,855,703 1,141,336 64.9 14,491,698 1,241,535 73.0
Mean 55.5 65.2
Table 5
213 javac detailed results









Bytes allocated (collectable) 204,219,612
Reads ×106 70.4 526 70.4 526 19.5 526 18.7
L1 read miss ratio 8.87% 1.01% 8.87% 1.03% 42.9% 1.94% 4.15%
Writes ×106 38.1 202 38.1 146 3.50 135 20.9
L1 write miss ratio 4.13% 1.83% 0.06% 0.13% 12.4% 0.21% 3.10%
L1 explicit tag checks × 106 1.53 0.00 0.89
L2 reads ×106 6.25 5.33 6.25 5.40 8.34 16.6 2.41
L2 read miss ratio 51.8 14.2 52.07 14.2 0.00 16.6 0.10
(local/global) % 4.59 0.14 4.62 0.14 0.00 0.32 0.65
L2 writes ×106 1.57 3.71 0.02 0.19 0.43 0.29 0.65
L2 write miss ratio 99.95 95.97 0.51 9.99 0.00 20.6 43.3
(local/global) % 4.13 1.76 0.00 0.01 0.00 0.04 1.34
Combined L2 misses ×106 4.81 4.31 3.25 0.78 0.00 1.75 0.30
Zero & allocate ×106 1.57 3.54 0.00 6.51 1.16
L2 modify mark ×106 13.6 0 0
L2 explicit tag checks × 106 45.8 0 0.62
Translations (reads) ×106 0.00 1.75 0.27
Translations (evictions) ×106 0.01 4.41 1.14
We do not believe that the fragmentation in the cache is a serious issue. Clearly, it negatively affects the cache
miss rates for some applications. However, our results include the costs of this effect and still show an improvement
in overall performance. The cost of fragmentation (fewer objects fit in the cache) is more than compensated for by
the improvement in GC efficiency. In Section 6.3 we further investigate the effect of using cache lines which are
significantly larger than objects.
6.2. Detailed study—213 javac
Table 5 presents a detailed look at the results for one of our benchmark programs, 213 javac. The table presents
cache read/write counts and miss ratios for the two L1 caches (aggregated) and the L2 cache, as well as the number of
translations and other cache statistics. Each metric is broken down by activity (GC vs. mutator vs. OID management).
Results for other benchmarks are also in our technical report [35].
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 161
Overall, the number of L2 cache misses is considerably smaller with the ICGC. The conventional collector makes
very poor use of the L2 cache, with a 52% read miss ratio and over 99% write miss ratio (caused by copying objects
into the new semi-space or promoting them), while the ICGC read miss ratio is close to zero and there are no write
misses at all. The mutator thread in the conventional system sees high write miss rates as well, as it fetches cache lines
from memory during object allocation. The use of zero-and-allocate in the conventional system eliminates most of the
write misses, and in fact saves 56M allocation writes (28%), but the GC still takes many read misses. The in-cache
GC in the object system successfully eliminates all L2 misses during GC, resulting in a total of half the L2 misses of
the conventional system with zero-and-allocate, or less than one quarter without.
6.3. Sensitivity studies
The results reported above, for the configuration described in Section 5.3, explore only a single point in a very large
parameter space. We have performed some limited sensitivity studies to determine how our results may be affected by
different choices of cache size, cache line size, and collection interval. To save space we do not include the raw data
for these experiments.
6.3.1. Collection interval
The object system reclaims more garbage when it collects more frequently; the conventional system, given the
tenuring policy in use, reclaims more garbage by collecting less frequently (i.e., by using larger semi-spaces). For the
conventional system, our results show a small but consistent improvement in garbage reclaimed as semi-space size
is increased from 2 to 4, 8, and then 16 MB. On some benchmarks the L2 cache misses rise to a peak with 8 MB
semi-spaces and then start to improve, although this is not universal. When the semi-spaces do not fit in the cache
larger semi-spaces should only improve matters, and the standard trade-off of throughput versus heap footprint must
determine the choice of semi-space size.
In the object system the number of L2 misses is very insensitive to the collection interval. The cost/benefit decision
is therefore between reclaimed garbage and the number of in-cache operations. Collecting every 2 MB of allocation
reclaims about 10%–20% more garbage than our default choice of 4 MB, at the cost of about twice as many in-
cache operations. On the other hand, collecting every 8 MB (the size of the L2 cache) results in a similar decrease
in reclamation for an approximately 40% reduction in in-cache operations. Given the sensitivity to this parameter,
we conclude that our choice of 4 MB is fair but that more research is needed into in-cache GC initiation policies,
including adaptive techniques.
6.3.2. Cache size
A smaller, 4 MB, L2 cache results in the reclamation of only around 75% of the garbage that is collected in the
8 MB object cache. This number could be improved by collecting more frequently, but the high cost was discussed
above and the smaller cache will result in more garbage escaping in any case. The cache size should certainly be
factored into the GC initiation policy.
6.3.3. Cache line size
The most interesting result is the variation with respect to cache line size. We do not see a penalty, either in cache
misses or in garbage reclaimed, for 128-byte instead of 64-byte cache lines in the object system, keeping the cache
size constant at 8 MB in each case. As might be expected, the short non-array objects suffer a few more cache misses,
but overall a decrease in misses for arrays more than compensates. This is further evidence that some fragmentation in
the object cache, which is worse with the longer cache lines, is acceptable. Cache lines of 256 bytes and longer start
to cause significant increases in cache misses on some benchmarks.
In the conventional system, without the zero-and-allocate instruction, a substantial fraction of misses are due to
allocation, and very long cache lines, up to 512 bytes, continue to show improvements in L2 misses. Note that in
this case, although the number of misses is reduced, more bytes are in fact being moved around through the memory
system.
162 G. Wright et al. / Science of Computer Programming 62 (2006) 145–163
7. Summary and conclusions
We have described a novel memory architecture for objects which extends a classical system. The architecture is
based on a hybrid object-conventional cache, a warden to police the GC barrier, and a translator to map object names
to physical addresses using an object table. Objects are addressed using an object ID and offset; the object cache is also
indexed and tagged using encoded forms of these. An in-cache garbage collector exploits the architecture to provide
cheap, low-pause collections, entirely within the caches in a GC boundary, without requiring explicit synchronization
with processing outside the GC boundary. A bifurcated object layout allows exact GC without requiring additional
mapping structures. Skewing results in bifurcated objects residing (usually) in the minimal number of cache lines.
The architecture enables concurrent relocation of objects and hence concurrent compaction of the heap.
Our results indicate that the performance of the object cache and in-cache garbage collection compares favorably
with a classical system, and yet the object cache allows us to perform incremental compaction and relocation. The
object cache and in-cache GC reduce cache misses considerably, at the cost of extra operations to manage the cache.
We conclude that object caching is a promising technique for scaling systems to very large heaps. The data for in-
cache GC are less conclusive: whether a performance advantage would be seen depends on the costs and latencies in a
particular implementation, and our simulations are not intended to model an implementation at that level of accuracy.
The conventional semi-space scavenging collector has two advantages: pointer-bumping allocation is cheaper than
management of an OID free list, and the time efficiency of GC can be improved by using larger semi-spaces because
the scavenging GC does not touch the dead objects. The in-cache GC offers hardware-supported concurrency and
reclamation without L2 cache misses, but whether there is an advantage in throughput depends on the “constant
factors”. Our simulations show a rough parity in performance if the cost of an L2 cache miss is equated to about 10
cache hits.
8. Future work
Many details remain to be completed, and there is much scope for further investigation.
Our results show that the object cache does not adversely penalize small-scale applications in their cache and new-
space GC behavior. We have yet to demonstrate the performance and scaling advantages that the object cache should
facilitate for large applications; this will require the modeling of long-running applications with large heaps.
Our results show that a large component of GC activity in the object cache is due to OID management. This code
is not heavily optimized and it should be possible to reduce these overheads.
We are investigating GC beyond the caches (“old-space GC”), and the contribution the warden and translator can
make to this. Unlike other generational schemes, the initiation of in-cache collection affects only performance and not
correctness, so we plan to investigate heuristics for when to initiate in-cache GC.
A more complete evaluation to obtain performance estimates requires the implementation of a realistic processor
model and the porting and modification of a state-of-the-art JVM to this model.
In addition, we plan to explore the new applications that these additions can enable or simplify in the virtual
machine. Some of the applications we have been considering are support for transparent persistent object stores and
allowing objects to be relocated within a cluster of servers. Both of these techniques may be enabled at a much lower
overhead through the translator hardware.
Acknowledgements
We thank Dave Ungar, Gary Lauterbach, Adam Talcott, Bernd Mathiske and the anonymous referees for their
comments.
References
[1] A. Goldberg, D. Robson, Smalltalk-80: The Language and its Implementation, Addison-Wesley, 1983.
[2] Y. Shuf, M.J. Serrano, M. Gupta, J.P. Singh, A study of memory behavior of Java workloads, in: Vijaykrishnan and Wolczko [36], pp. 19–39.
[3] J.M. Chang, W. Srisa-an, C.-T.D. Lo, E.F. Gehringer, DMMX: Dynamic memory management extensions, Journal of Systems and Software
63 (3) (2002) 187–199.
[4] K.D. Nilsen, W.J. Schmidt, A high-performance hardware-assisted real time garbage collection system, Journal of Programming Languages
2 (1) (1994) 1–40.
G. Wright et al. / Science of Computer Programming 62 (2006) 145–163 163
[5] D.S. Hardin, A.P. Mass, M.H. Masters, N.M. Mykris, An efficient hardware implementation of Java bytecodes, threads and processes for
real-time applications, in: Vijaykrishnan and Wolczko [36], pp. 41–51.
[6] M. O’Connor, M. Tremblay, picoJava-1: The Java virtual machine in hardware, IEEE Micro 17 (2) (1997) 45–53.
[7] Intel Corp., iAPX 432 GDP Architecture Reference Manual, 171860-001, 1981.
[8] H. Levy, Capability-based Computer Systems, Digital Press, 1984.
[9] E. Witchel, J. Cates, K. Asanovic´, Mondrian memory protection, in: Proceedings of the 10th International Symposium on Architectural
Support for Programming Languages and Operating Systems, ACM, San Jose, CA, 2002, pp. 304–316.
[10] D.M. Ungar, The Design and Evaluation of a High-Performance Smalltalk System, MIT Press, 1987.
[11] H.-J. Boehm, A.J. Demers, S. Shenker, Mostly parallel garbage collection, in: Proceedings of the ACM SIGPLAN 1991 Conference on
Programming Language Design and Implementation, Toronto, Ontario, Canada, 1991, pp. 157–164.
[12] I.W. Williams, M.I. Wolczko, An object-based memory architecture, in: Proceedings of the Fourth International Workshop on Persistent
Object Systems, Morgan Kaufmann, Martha’s Vineyard, MA, 1990, pp. 113–130.
[13] G. Czajkowski, L. Dayne`s, Multitasking without compromise: a virtual machine evolution, in: Proceedings of the 16th Annual ACM
SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, Tampa Bay, FL, 2001, pp. 125–138.
[14] S.M. Blackburn, P. Cheng, K.S. McKinley, Myths and realities: the performance impact of garbage collection, in: SIGMETRICS
2004/PERFORMANCE 2004: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems,
ACM, New York, NY, 2004, pp. 25–36.
[15] M.I. Wolczko, D.M. Ungar, Method and apparatus for optimizing exact garbage collection using a bifurcated data structure, U.S. Patent
5,900,001 (May 1999).
[16] E. Gagnon, L. Hendren, SableVM: A research framework for the efficient execution of Java bytecode, in: Java Virtual Machine Research and
Technology Symposium, JVM’01, USENIX, Monterey, CA, 2001, pp. 27–40.
[17] IBM Corp., Assembler language reference – AIX 5L for POWER-based Systems, April, 2001.
[18] C.J. Peng, G.S. Sohi, Cache memory design considerations to support languages with dynamic heap allocation, Tech. Rep., 860, University
of Wisconsin Computer Science Dept., 1989.
[19] R. Jones, R. Lins, Garbage Collection: Algorithms for Automatic Dynamic Memory Management, Wiley, 1996.
[20] M.I. Wolczko, I.W. Williams, Multi-level garbage collection in a high-performance persistent object system, in: Proceedings of the Fifth
International Workshop on Persistent Object Systems, Springer-Verlag, Pisa, Italy, 1992, pp. 396–418.
[21] M. Hill et al., SPUR: A VLSI multiprocessor workstation, IEEE Computer 19 (11) (1986) 8–22.
[22] T. Chiueh, An architectural technique for cache-level garbage collection, in: Record of the 1991 Conference on Functional Programming
Languages and Computer Architecture, in: Lecture Notes in Computer Science, vol. 523, Springer-Verlag, Cambridge, MA, 1991,
pp. 520–537.
[23] S. Dieckmann, U. Ho¨lzle, A case for using active memory to support garbage collection, in: Proceedings of the Workshop on Hardware
Support for Objects and Microarchitectures for Java, Austin, TX, 1999, in conjunction with ICCD’99.
[24] J.E. Smith, S. Sastry, T. Heil, T. Bezenek, Achieving high performance via co-designed virtual machines, in: Proceedings of the 1998
International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, IEEE, 1999, pp. 77–84.
[25] A. Diwan, D. Tarditi, J.E.B. Moss, Memory subsystem performance of programs using copying garbage collection, in: Conference Record of
the 21st Annual ACM Symposium on Principles of Programming Languages, Portland, OR, 1994, pp. 1–14.
[26] M. Karlsson, K.E. Moore, E. Hagersten, D.A. Wood, Memory system behavior of Java-based middleware, in: Proceedings of the Ninth
International Symposium on High Performance Computer Architecture, IEEE, Anaheim, CA, 2003, pp. 217–228.
[27] P.R. Wilson, M.S. Lam, T.G. Moher, Caching considerations for generational garbage collection, Tech. Rep. EECS-90-5, University of Illinois
at Chicago, 1990.
[28] M. Horowitz, M. Martonosi, T.C. Mowry, M.D. Smith, Informing memory operations: memory performance feedback mechanisms and their
applications, ACM Transactions on Computer Systems 16 (2) (1998) 170–205.
[29] C. Click, G. Tene, M. Wolf, The Pauseless GC algorithm, in: Proceedings of the First ACM/USENIX International Conference on Virtual
Execution Environments, VEE’05, Chicago, IL, 2005, pp. 46–56.
[30] M.V. Wilkes, Hardware support for memory protection: Capability implementation, in: Proceedings of the First International Symposium on
Architectural Support for Programming Languages and Operating Systems, ACM, Palo Alto, CA, 1982, pp. 107–116.
[31] M. Wolczko, Using a tracing Java virtual machine to gather data on the behavior of Java programs,
http://research.sun.com/people/mario/tracing-jvm/, 1999.
[32] T. Lindholm, F. Yellin, The Java Virtual Machine Specification, Addison-Wesley, 1997.
[33] Standard Performance Evaluation Council, SPEC JVM98, http://www.spec.org/jvm98/.
[34] Standard Performance Evaluation Council, SPEC JBB2000, Java Business Benchmark, http://www.spec.org/jbb2000/.
[35] G. Wright, M.L. Seidl, M. Wolczko, An object-aware memory architecture, Tech. Rep. SMLI-TR-2005-143, Sun Microsystems Laboratories,
February, 2005.
[36] N. Vijaykrishnan, M.I. Wolczko (Eds.), Java Microarchitectures, Kluwer, 2002.
