Abstract. Object-oriented systems must implement message dispatch efficiently in order not to penalize the object-oriented programming style. We characterize the performance of most previously published dispatch techniques for both statically-and dynamically-typed languages with both single and multiple inheritance. Hardware organization (in particular, branch latency and superscalar instruction issue) significantly impacts dispatch performance. For example, inline caching may outperform C++-style "vtables" on deeply pipelined processors even though it executes more instructions per dispatch. We also show that adding support for dynamic typing or multiple inheritance does not significantly impact dispatch speed for most techniques, especially on superscalar machines. Instruction space overhead (calling sequences) can exceed the space cost of data structures (dispatch tables), so that minimal table size may not imply minimal run-time space usage.
Introduction
Message dispatch is a central feature of object-oriented languages. Given a receiver object and a selector (i.e., operation name), message dispatch finds the method implementing the operation for the particular receiver object. Since message dispatch is performed at run-time and is a very frequent operation in object-oriented programs, it must be fast. Therefore, the efficient implementation of message dispatch has been the subject of much previous work. Unfortunately, this research has often presented particular dispatch implementations in isolation, without comparing them to other methods. This paper presents several dispatch techniques in a common framework and compares their cost on modern computer architectures. The study includes most previously published dispatch techniques for both statically-and dynamically-typed languages with both single and multiple inheritance. Any comparative study of dispatch mechanisms must be a compromise between breadth and depth since it is impossible to explore the entire design space in a single paper. While the present study considers several aspects of dispatch mechanisms (such as speed and space efficiency), the main focus is on run-time dispatch performance. But even when considering only run-time dispatch speed, a myriad of issues must be addressed. The remainder of this introduction briefly discusses and justifies the issues we address as well as those we don't.
Specific measurements vs. analytical models
Previous studies have evaluated the run-time performance of specific dispatch implementations relative to specific systems, languages, and applications; some have not evaluated run-time performance at all. While specific empirical measurements are useful and desirable, they are also limited in scope. Different languages or applications may have different dispatch characteristics, and an implementor who is trying to choose between dispatch techniques may not yet know how the new system relates to the system used for the specific measurements. As discussed below, different processor implementations also change the relative speed of dispatch mechanisms. Any performance comparison using specific measurements will therefore be relative to the particular processors, languages, applications, and run-time systems used. Therefore, instead of giving concrete measurements, we chose to characterize the performance of each dispatch mechanisms as a function of several configuration parameters that are dependent on the hardware and software environment of the system using the dispatch mechanism. To compare dispatch performance in a new system, an implementor therefore merely needs to measure (or approximate) the values of these performance parameters in that system. By keeping our performance analysis abstract, we hope that this study will be helpful to implementors of a wide range of systems, languages, and applications on a range of hardware platforms. To help illustrate specific points of trends in the analysis, we also present absolute performance numbers which were obtained by using typical values (taken from previous studies) for the parameters.
Processor architecture
Dispatch cost is intimately coupled with processor implementation. The same dispatch sequence may have different cost on different processor implementations, even if all of them implement the same architecture (e.g., the SPARC instruction set). In particular, processor pipelining and superscalar execution make it impossible to use the number of instructions in a code sequence as an accurate performance indicator. This paper characterizes the run-time performance of dispatch mechanisms on modern pipelined processors by determining the performance impact of branch latency and superscalar instruction issue. In addition to providing specific numbers for three example architectures, our analysis allows dispatch performance to be computed for a wide range of possible (future) processors. With the rapid change in processor design, it is desirable to characterize performance in a way that makes the dependence on certain processor characteristics explicit, so that performance on a new processor can be estimated accurately as long as the processor's characteristics are known.
Influence of dynamic typing
In dynamically-typed languages, a program may try to invoke an operation on some object for which the operation is undefined ("message not understood" error).
Therefore, each message dispatch usually needs to include some form of run-time check to guarantee that such errors are properly caught and reported to the user. Most techniques that support static typing can be extended to handle dynamic typing as well. Our study shows the additional dispatch cost of dynamic typing for all dispatch mechanisms that can support it.
Single versus multiple inheritance
A system using multiple inheritance (MI) introduces an additional difficulty if compiled code uses hard-coded offsets when addressing instance variables. For example, assume that class C inherits directly from classes A and B (Figure 1 ). In order to reuse compiled code of class A, instances of C would have to start with the instance variables of A (i.e., A's memory layout must be a prefix of C's layout). But the compiled code in class B requires a conflicting memory layout (B's instance variables must come first), and so it seems that compiled code cannot be reused if it directly addresses instance variables of an object.
Hard-coded offsets can be retained if the receiver object's address is adjusted just before a B method is executed, so that it points to the B subobject within C [Kro85, ES90] .
1
The adjustment can be different for every class that has B as a (co-)parent. (If dynamic typing is combined with multiple inheritance it is necessary to keep track of both the unadjusted address and the adjustment.) Strictly speaking, this extra code is not part of method lookup, but if multiple inheritance is allowed, every method invocation is preceded by a receiver address adjustment, and thus we chose to include the cost of this adjustment in our study.
Limitations
This study is limited to single (i.e., receiver-based) dispatch. Since multi-method dispatch techniques (e.g., [KR90, AGS94] ) are similar and include single dispatch as an important (very frequent) special case, we hope that the results will nevertheless be useful for implementors or designers of multiple dispatch techniques. Also, we do not consider dynamic inheritance (as used in SELF [CU+91] ), i.e., inheritance hierarchies that can change their structure at run-time. Furthermore, we focus on dispatch performance and only briefly discuss other issues such as space overhead and the closed-vs. open-world assumption (see section 6). For space reasons, we consider only the main variant of each technique.
1 Alternatively, a system could duplicate code (e.g., as in SELF [CUL89] ) or access instance variables indirectly (as is done in Eiffel and Sather [MS94] ). 
A B C adjusted pb
A further simplification on the hardware side is that we do not consider every possible processor implementation feature. However, as will be explained in section 5.1, the features we consider represent a very large fraction of past and current processors. Further hardware-related limitations are discussed in section 5.5.
Overview of the paper
The remainder of this paper is organized as follows. Sections 2 to 4 briefly review the dispatch techniques evaluated. Section 5 presents the results of our performance analysis, and section 6 discusses space costs and other issues.
Method lookup techniques
Message lookup is a function of the message name (selector) and the receiver class. If lookup speed was unimportant, lookup could be performed by searching class-specific dispatch tables. When an object receives a message, the object's class is searched for the corresponding method, and if no method is found the lookup proceeds in the superclass(es). Since it searches dispatch tables for methods, this technique is called Dispatch Table Search (DTS) . The right-hand side of Figure 2 shows the dispatch tables of the class hierarchy on the left. Each entry in a dispatch table contains the method name and its address. As in all other figures, capital letters (A, B, C) denote classes and lowercase letters denote methods.
Since the memory requirements of DTS are minimal (i.e., proportional to the number of methods in the system), DTS is often used as a backup strategy which is invoked when faster methods fail. If desired, DTS can employ hashing to speed up the 
Dynamic techniques
Dynamic techniques speed up message lookup by using various forms of caching at run-time. Therefore, they depend on locality properties of object-oriented programs: caching will speed up programs if the cached information is used often before it is evicted from the cache. This section discusses two kinds of caching: global caching (one large cache per system) and inline caching (one small cache per call site).
Global lookup caches (LC)
First-generation Smalltalk implementations relied on a global cache to speed up method lookup [GR83, Kra83] . The class of a receiver, combined with the message selector, hashes into an index in a global cache. Each cache entry consists of a class, a selector and a method address. If the current class and selector match the ones found in the entry, the resident method is executed. Otherwise, a dispatch table search finds the correct method, and the new class-selector-method triple replaces the old cache entry (directmapped cache). Any hash function can be used; to obtain a lower bound on lookup time, we assume a simple exclusive OR of receiver class and selector.
To allow hard-coded instance variable offsets in a multiple inheritance context, the receiver is adjusted by #delta. it still has to compute a hash function for each dispatch. As we shall see, this computation renders LC too slow compared to other techniques. However, LC is a popular fallback method for inline caching.
Inline caches (IC)
Often, the type of the receiver at a given call site rarely varies; if a message is sent to an object of type X at a particular call site, it is likely that the next send will also go to an object of type X. For example, several studies have shown that the receiver type at a given call site remains constant 95% of the time in Smalltalk code [DS84, Ung87, UP87] . This locality of type usage can be exploited by caching the looked-up method address at the call site. Because the lookup result is cached "in line" at every call site (i.e., no separate lookup cache is accessed in the case of a hit), the technique is called inline caching [DS84, UP87] . The previous lookup result is cached by changing the call instruction implementing the send, i.e., by modifying the compiled program on the fly. Initially, the call instruction calls the system's lookup routine. The first time this call is executed, the lookup routine finds the target method. Before branching to the target, the lookup routine changes the call instruction to point to the target method just found (Figure 4 ). Subsequent executions of the send directly call the target method, completely avoiding any lookup. Of course, the type of the receiver could have changed, and so the prologue of the called method must verify that the receiver's type is correct and call the lookup code if the type test fails.
Inline caches are very efficient in the case of a cache hit: in addition to the function call, the only dispatch overhead that remains is the check of the receiver type in the prologue of the target. The above code sequence works for both static and dynamic typing; in the MI case, an inline cache miss updates both the call instruction and the add instruction adjusting the receiver address. The dispatch cost of inline caching critically depends on the hit ratio. In the worst case (0% hit ratio) it degenerates to the cost of the technique used by the system lookup routine (often, a global lookup cache), plus the extra overhead of the instructions updating the inline cache. Fortunately, hit ratios are usually very good, on the order of 90-99% for typical Smalltalk or SELF code [Ung87, HCU91] . Therefore, many current Smalltalk implementations incorporate inline caches.
Polymorphic inline caching (PIC)
Inline caches are effective only if the receiver type (and thus the call target) remains relatively constant at a call site. Although inline caching works very well for the majority of sends, it does not speed up a polymorphic call site 1 with several equally likely receiver types because the call target switches back and forth between different methods, thus increasing the inline cache miss ratio. The performance impact of inline cache misses can become severe in highly efficient systems. For example, measurements of the SELF-90 system showed that it spent up to 25% of its time handling inline cache misses [HCU91] . Polymorphic inline caches (PICs) [HCU91] reduce the inline cache miss overhead by caching several lookup results for a given polymorphic call site using a dynamicallygenerated PIC routine. Instead of just switching the inline cache at a miss, the new receiver type is added to the cache by extending the stub routine. For example, after encountering receiver classes A and B, a send of message m would look as in Figure 5 .
A system using PICs treats monomorphic call sites like normal inline caching; only polymorphic call sites are handled differently. Therefore, as long as the PIC's dispatch sequence (a sequence of ifs) is faster than the system lookup routine, PICs will be faster than inline caches. However, if a send is megamorphic (invokes many different methods), it cannot be handled efficiently by PICs. Fortunately, such sends are the exception rather than the rule.
Static techniques
Static method lookup techniques precompute their data structures at compile time (or link time) in order to minimize the work done at dispatch time. Typically, the dispatch code retrieves the address of the target function by indexing into a table and performing an indirect jump to that address. Unlike lookup caching (LC), static methods usually don't need to compute a hash function since the table index can be computed at compile time. Also, dispatch time usually is constant 2 , i.e., there are no "misses" as in inline caching.
Selector Table Indexing (STI)
The simplest way of implementing the lookup function is to store it in a twodimensional table indexed by class and selector codes. Both classes and selectors are 6). Unfortunately, the resulting dispatch table is very large (O(c*s)) and very sparse, since most messages are defined for only a few classes. For example, about 95% of the entries would be empty in a table for a Smalltalk image [Dri93b] . With multiple inheritance, every entry consists of a method code address and a delta (the adjustment to the receiver address). To avoid cluttering the graphics, we do not show the latter in any figure.
STI works equally well for static and dynamic typing, and its dispatch sequence is fast. However, because of the enormous space cost, no real system uses selector table indexing. All of the static techniques discussed below try to retain the idea of STI (indexing into a table of function pointers) while reducing the space cost by omitting empty entries in the dispatch table.
Virtual function tables (VTBL)
Virtual function tables were first used in Simula [DM73] and today are the preferred C++ dispatch mechanism [ES90] . Instead of assigning selector codes globally, VTBL assigns codes only within the scope of a class. In the single-inheritance case, selectors are numbered consecutively, starting with the highest selector number used in the superclass. In other words, if a class C understands m different messages, the class's message selectors are numbered 0..m-1. Each class receives its own dispatch table (of size m), and all subclasses will use the same selector numbers for methods inherited from the superclass. The dispatch process consists of loading the receiver's dispatch table, loading the function address by indexing into the table with the selector number, and jumping to that function. With multiple inheritance, keeping the selector code correct is more difficult. For the inheritance structure on the left side of Figure 7 , functions c and e will both receive a selector number of 1 since they are the second function defined in their respective class. D multiply inherits from both B and C, creating a conflict for the binding of selector number 1. In C++ [ES90] , the conflict is resolved by using multiple virtual tables per class. An object of class D has two dispatch tables, D and Dc (see Figure 7) .
1 Message sends will use dispatch table D if the receiver object is viewed as a B or a D and table Dc if the receiver is viewed as a C. As explained in section 1.4, the dispatch code will also adjust the receiver address before calling a method defined in C. VTBL depends on static typing: without knowing the set of messages sent to an object, the system cannot reuse message numbers in unrelated classes (such as using 0 for the first method defined in a top-level class). Thus, with dynamic typing, VTBL dispatch tables would degenerate to STI tables since any arbitrary message could be sent to an object, forcing selector numbers to be globally unique.
Selector coloring (SC)
Selector coloring [D+89, AR92] is a compromise between VTBL and STI. SC is similar to STI, but instead of using the selector to index into the table, SC uses the selector's color. The color is a number that is unique within every class where the selector is known, and two selectors can share a color if they never co-occur in a class. SC allows more compaction than STI, where selectors never share colors, but less compaction than VTBL, where a selector need not have a single global number (i.e., where the selector m can have two different numbers in unrelated classes).
Optimally assigning colors to selectors is equivalent to the graph coloring problem 2 which is NP-complete. However, efficient approximation algorithms can often approach or even reach the minimal number of colors (which is at least equal to the maximum number of messages understood by any particular class). The resulting global dispatch table is much smaller than in STI but still relatively sparse. For example, 43% of the entries are empty (i.e., contain "message not understood") for the Smalltalk 1 Due to limited space, we ignore virtual base classes in this discussion. They introduce an extra overhead of a memory reference and a subtraction [ES90] . 2 The selectors are the nodes of the graph, and two nodes are connected by an arc if the two selectors cooccur in any class. system [Dri93b] . As shown in Figure 8 , coloring allows the sharing of columns of the selector table used in STI.
Compared to VTBL, SC has two potential advantages. First, since selector colors are global, only one dispatch table is needed per class, even in the context of multiple inheritance. Secondly, and for the same reason, SC is applicable to a dynamically-typed environment since any particular selector will have the same table offset (i.e., color) throughout the system and will thus invoke the correct method for any receiver. To guard against incorrect dispatches, the prologue of the target method must verify the message selector, and thus the selector must be passed as an extra argument. Otherwise, an erroneous send (which should result in a "message not understood" error) could invoke a method with a different selector that shares the same color. For example, in Figure 8 , message c sent to a E object would invoke b without that check.
Row displacement (RD)
Row displacement [Dri93a] is another way of compressing STI's dispatch table. It slices the (two-dimensional) STI table into rows and fits the rows into a onedimensional array so that non-empty entries overlap only with empty ones (Figure 9 ). Row offsets must be unique (because they are used as class identifiers), so no two rows start at the same index in the master array. The algorithm's goal is to minimize the size of the resulting master array by minimizing the number of empty entries; this problem is similar to parse 
Compact Selector-Indexed Dispatch Tables (CT)
The third table compaction method [VH94] , unlike the two previous methods, generates selector-specific dispatch code sequences. The technique separates selectors into two categories. Standard selectors have one main definition and are only overridden in the subclasses (e.g., a and b in Figure 10 ). Conflict selectors have multiple definitions in unrelated portions of the class hierarchy (e.g., e in Figure 10 which is defined in the unrelated classes C and D). CT uses two dispatch tables, a main table for standard selectors and a conflict table for conflict selectors.
Standard selectors can be numbered in a simple top-down traversal of the class hierarchy; two selectors can share a number as long as they are defined in different branches of the hierarchy. Such sharing is impossible for conflict selectors, and so the conflict table remains sparse (Figure 10 ). But the allocation of both tables can be further optimized. First, tables with identical entries (such as the conflict tables for C and E) can be shared. Second, tables meeting a certain similarity criterion-a parameter to the algorithm-can be overloaded; divergent entries refer to a code stub which selects the appropriate method based on the type (similar to PIC). In Figure 10 (a), the entry for selectors c and b of tables (A, C, E) is overloaded. The required level of similarity affects the compression rate (stricter requirements decrease the compression rate) as well as dispatch speed (stricter requirements decrease the number of overloaded entries and thus improve dispatch speed). Finally, dispatch tables are trimmed of empty entries and allocated onto one large master array as shown in Figure 10 prologue in dynamically-typed languages. For statically-typed languages, only the code stubs of overloaded entries need such a test. Subtype tests are implemented with a simple series of logical operations (a bit-wise AND and a comparison) [Vit95] . Figure  11 shows the code for a call through a CT dispatch table.
This version of the algorithm (from [VH94] ) only handles single inheritance, because of the lack of fast type inclusion test for multiple inheritance. 
Analysis

Parameters influencing performance
To evaluate the performance of the dispatch mechanisms, we implemented the dispatch instruction sequence of each technique on a simple RISC-like architecture.
2 Table A -2 in the Appendix lists the resulting instruction sequences. Then, we measured the cost of the dispatch sequences for three hypothetical processor implementations. P92 represents a scalar implementation as it was typical of processor designs in 1992. P95 is a superscalar implementation that can execute up to two integer instructions concurrently, representative of current state-of-the art processor designs. Finally, P97 is an estimate of a 1997 superscalar processor with four-instruction issue width and a deeper pipeline. Table 2 lists the detailed processor characteristics relevant to the study. In essence, these processors are abstractions of current commercial processors that have been reduced to their most important performance features, namely • Superscalar architecture. The processor can execute several instructions in parallel as long as they are independent. Since access paths to the cache are expensive, all but P97 can execute at most one load or store per cycle.
• Load latency. Because of pipelining, the result of a load started in cycle i is not available until cycle i + L (i.e., the processor will stall if the result is used before that time). • Branch penalty. The processor predicts the outcome of a conditional branch; if the prediction is correct, the branch incurs no additional cost. However, if the prediction is incorrect, the processor will stall for B cycles while fetching and decoding the instructions following the branch [HP90] . We assume that indirect calls or jumps cannot be predicted and always incur the branch penalty.
1
Virtually all processors announced since 1993 exhibit all three characteristics. We also assumed out-of-order execution for the superscalar machines (P95 and P97). To determine the number of cycles per dispatch, we hand-scheduled the dispatch instruction sequences for optimal performance on each processor. In most cases, a single instruction sequence is optimal for all three processors.
The performance of some dispatch techniques depends on additional parameters (listed in Table 3 ). In order to provide some concrete performance numbers in addition to the formulas, we chose typical values for these parameters (most of them based on a To simplify the analysis, we assumed L > 1; to the best of our knowledge, this assumption holds for all RISC processors introduced since 1990. b No penalty if the branch's delay slot can be filled. (To improve readability, the instruction sequences in Table A Tables 4 to 6 show dispatch costs as a function of processor parameters (L and B) and algorithmic parameters such as miss ratios, etc. because it depends on instruction 1 (Figure 12 ). Similarly, instruction 5 can execute at L + L or L + 2 (one cycle after the previous instruction), whichever is later. Since we assume L > 1, we retain 2L. The schedule for P92 also shows that instruction 3 (which is part of the multiple inheritance implementation) is free: even if it was eliminated, instruction 5 could still not execute before 2L since it has to wait for the result of instruction 2. Similarly, instruction 4 is free because it executes in the delay slot of the call (instruction 5). 1 As a result, VTBL incurs no overhead for multiple inheritance: both versions of the code execute in 2L + 2 cycles (see Table 4 ). Figure 12 ) can execute two instructions per cycle (but only one of them can be a memory instruction, see Table 2 ). Unfortunately, this capability doesn't benefit VTBL much since its schedule is dominated by load latencies and the branch latency B. Since VTBL uses an indirect call, the processor does not know its target 1 Recall that P92 machines had a branch latency B = 1, which can be eliminated using explicit branch delay slots; see [HP90] for details. Since we use a fixed branch penalty for P92, B does not appear as a parameter in Table 4 address until after the branch executes (in cycle 2L). At that point, it starts fetching new instructions, but it takes B cycles until the first new instruction reaches the EX (execute) stage of the pipeline [HP90] , resulting in a total execution time of 2L+B+1. Finally, P97 can execute up to 4 instructions per cycle, but again this capability is largely unused, except that instructions 2 and 3 (two loads) can execute in parallel. However, the final cycle count is unaffected by this change. Figure 13a shows the execution time (in processor cycles) of all dispatch implementations on the three processor models, assuming static typing and single inheritance. Not surprisingly, all techniques improve significantly upon lookup caching (LC) since LC has to compute a hash function during dispatch. The performance of the other dispatch mechanisms is fairly similar, especially on P95 which models current hardware. VTBL and SC are identical for all processors; RD and VTBL are identical for all but the P92 processor. Among these techniques, no clear winner emerges since their relative ranking depends on the processor implementation. For example, on P92 VTBL performs best and IC worst, whereas on P97 IC is best and VTBL is worst.
Overview of dispatch costs
P95 (middle part of
(Section 5.4 will examine processor influence in detail.) For dynamic typing, the picture is qualitatively the same (Figure 13b ).
Cost of multiple inheritance and dynamic typing
A closer look at Tables 4 to 6 and Figure 13 shows that supporting dynamic typing is surprisingly cheap for all dispatch methods, especially on superscalar processors like P95 and P97. In several cases (LC, IC, PIC), dynamic typing incurs no overhead at all. For the other techniques, the overhead is still low since the additional instructions can be scheduled to fit in instruction issue slots that would otherwise go unused. Typical overheads are two cycles per dispatch on P95 and one or two cycles on P97. Thus, on superscalar processors dynamic typing does not significantly increase dispatch cost. The cost of supporting multiple inheritance is even lower. On P97, no technique incurs additional overhead for multiple inheritance, and only LC, RD, and IC incur a one-cycle overhead on P95. (However, recall that we have simplified the discussion of VTBL for C++ by ignoring virtual base classes. Using virtual base classes can significantly increase dispatch cost in VTBL.) Since the performance variations between the four scenarios are so small and do not qualitatively change the situation, we will only discuss the case using static typing and single inheritance in the remainder of the paper. The data for the other variations can be obtained from Table A -3 in the Appendix. (Of course, dynamic typing and multiple inheritance can affect other aspects of dispatch implementation; these will be discussed in section 6).
Influence of processor implementation
According to Figure 13 , the cost (in cycles) of many dispatch techniques drops when moving from a scalar processor like P92 to a superscalar implementation like P95. Apparently, all techniques can take advantage of the instruction-level parallelism present in P95. However, when moving to the more aggressively superscalar P97 processor, dispatch cost rises for many dispatch techniques instead of falling further as one would expect.
1 Figure 14a shows that the culprit is the penalty for mispredicted branches. It rises from 3 cycles in P95 to 6 cycles in P97 because the latter processor has a deeper pipeline in order to achieve a higher clock rate and thus better overall performance [HP90] . Except for the inline caching variants (IC and PIC), all techniques have at least one unpredictable branch even in the best case, and thus their cost increases with the cost of a branch misprediction. IC's cost increases only slowly because it has no unpredicted branch in the hit case, so that it suffers from the increased branch miss penalty only in the case of a inline cache miss. PIC's cost also increases slowly since monomorphic calls are handled just as in IC, and even for polymorphic sends its branches remain relatively predictable. Based on this data, it appears that IC and PIC are attractive dispatch techniques, especially since they handle dynamically-typed languages as efficiently as statically- (a) all techniques (b) inline caching variants vs. VTBL typed languages. However, one must be careful when generalizing this data since the performance of IC and PIC depends on several parameters. In particular, the dispatch cost of IC and PIC is variable-unlike most of the Figure 14b compares VTBL with PIC and IC for several inline cache miss ratios. As expected, IC's cost increases with decreasing hit ratio. If the hit ratio is 90% or better, IC is competitive with static techniques such as VTBL as long as the processor's branch miss penalty is high (recall that P97's branch miss penalty is 6 cycles). In other words, if a 91% hit ratio is typical of C++ programs, IC would outperform VTBL for C++ programs running on a P97 processor. PIC outperforms VTBL independently of the processor's branch penalty, and it outperforms IC with less than a 95% hit ratio. The performance advantage can be significant: for P97's branch miss penalty of 6 cycles, PIC is twice as fast as VTBL. Again, this result is dependent on additional parameters that may vary from system to system. In particular, PIC's performance depends on the percentage of polymorphic call sites, the average number of receiver types tested per dispatch, and the frequency and cost of "megamorphic" calls that have too many receiver types to be handled efficiently by PICs. On the other hand, PIC needs only a single cycle per additional type test on P97, so that its efficiency is relatively independent of these parameters. For example, on P97 PIC is still competitive with VTBL if every send requires 5 type tests on average. As mentioned in section 3.3, the average degree of polymorphism is usually much smaller. Therefore, PIC appears to be an attractive choice on future processors like P97 that have a high branch misprediction cost. Nevertheless, the worst-case performance of PIC is higher than VTBL, and PIC doesn't handle highly polymorphic code well, so some system designers may prefer to use a method with lower worst-case dispatch cost. One way to achieve low average-case dispatch cost with low worst-case cost is to combine IC with a static technique like VTBL, SC, or RD. In such a system, IC would handle monomorphic call sites, and the static technique would handle polymorphic sites. (Another variant would add PIC for moderately polymorphic call sites.) The combination's efficiency depends on the percentage of call sites that are handled well by IC. Obviously, call sites with only one target fall in this category but so do call sites whose target changes very infrequently (so that the rare IC miss doesn't have a significant performance impact). The scheme's dispatch cost is a linear combination of the two techniques' cost. For example, Calder's data [CG94] suggest that at least 66% of all virtual calls in C++ could be handled without misses by IC, reducing dispatch cost on P97 from 13 cycles for a pure VTBL implementation to 13* 0.34 + 4 * 0.66 = 5.6 cycles for VTBL+IC. In reality, the performance gain might be even higher since calls from call sites incurring very few misses could also be handled by IC. Even though this data is by no means conclusive, the potential gain in dispatch performance suggests that implementors should include such hybrid dispatch schemes in their list of dispatch mechanisms to evaluate.
Limitations
The above analysis leaves a number of issues unexplored. Three issues are particularly important: cache behavior, application code surrounding the dispatch sequence, and hardware prediction of indirect branches. We do not consider memory hierarchy effects (cache misses); all results assume that memory references will always hit the first level memory cache. If all dispatch techniques have similar locality of reference, this assumption should not distort the results. However, without thorough benchmarking it remains unsubstantiated. Application instructions surrounding the dispatch sequence (e.g., instructions for parameter passing) can be scheduled to fit in the "holes" of the dispatch code, lowering the overall execution time, and thus effectively lowering dispatch overhead. Therefore, measuring dispatch cost in isolation (as done in this study) may overestimate the true cost of dispatch techniques. Unfortunately, the effect of co-scheduling application code with dispatch code depends on the nature of the application code and thus is hard to determine. Furthermore, the average basic block length (and thus the number of instructions readily available to be scheduled with the call) is quite small, usually between five and six [HP90] . On superscalar processors (especially on P97) most dispatch sequences have plenty of "holes" to accommodate that number of instructions. Thus, we assume that most techniques would benefit from co-scheduled application code to roughly the same extent. A branch target buffer (BTB) [HP90] allows hardware to predict indirect calls by storing the target address of the previous call, similar to inline caching. This study assumes that processors do not use BTBs; for most current RISC processors, this assumption holds because BTBs are relatively expensive (since they have to store the full target address, not just a few prediction bits) and because indirect calls are very infrequent in procedural programs. 1 However, future processors like P97 might incorporate BTBs since they will have enough transistors available to accommodate a reasonably-sized BTB; some processors (most notably, Intel's Pentium processor and its successor P6) have small BTBs today. Interestingly, BTBs behave like inline caches-they work well for monomorphic call sites but badly for highly polymorphic call sites. For example, the performance of VTBL on such a processor would be similar to the VTBL+IC scheme discussed above. The impact of BTBs on dispatch performance can be estimated by reducing the value of branch penalty B in the formulas of Tables 5 and 6 , but the extent of the reduction depends on the BTB miss ratio (i.e., inline cache miss ratio) of the application.
Besides the actual speed of message sends, other considerations influence the choice between dispatch techniques. This section discusses the memory costs of dispatch schemes, and how amenable the schemes are to incremental change.
Memory cost
The space overhead of method dispatch falls into two categories: program code and dispatch data structures. Code overhead consists of the instructions required at call sites and in method prologues; stub routines (PIC & CT) are counted towards the data structure cost. The analysis below ignores per-instance memory costs (such as keeping a type field in each instance), although such costs can possibly dominate all other costs (e.g., if more than one VTBL pointer is needed for a class with a million instances). The space analysis uses the parameters shown in Table 3 . Most parameter values are taken from the ParcPlace Visualworks 1.0 Smalltalk system and thus model a fairly large application. For the multiple inheritance overhead we do not give typical values because there are none. The few samples in which multiple inheritance is extensively used in [DH95] show that the overhead varies much more than with single inheritance hierarchies (between 215% and 330% for VTBL), and that it is extremely dependent on how frequently MI is used. Therefore we do not give example space data for multiple inheritance. However, it is obvious in [DH95] that P RD < P VTBL < P SC for samples with a As shown in [AR92] , the use of multiple inheritance introduces conflicts between selector colors that are hard to deal with and that substantially increase the overhead. b Tables are harder to fit together because multiple inheritance causes more irregular empty regions to appear. c Every time a class inherits from more than one superclass, overridden method entries are stored together with the appropriate delta's. This overhead depends entirely on the way multiple inheritance is used and is not quantifiable without appropriate code metrics. IC) is not counted since it only appears once and thus should be negligible. Table 8 shows the space cost computation for all techniques. In the formulas, the symbols D and C refer to data and code cost; D LC , for instance, refers to the data structure cost of LC in the same column. Figure 15 shows the space costs for single inheritance versions of the dispatch techniques, using the classes and methods of the ParcPlace Visualworks 1.0 Smalltalk system as an example. Surprisingly, the code space overhead dominates the overall space cost for six of the eight techniques. Most of that overhead consists of the per-call dispatch code sequence. Much of the literature has been concentrated on the size of dispatch tables, treating call code overhead as equivalent among different techniques ([Ung87] is a notable exception). As demonstrated by the above data, minimizing dispatch tables may not reduce the overall space cost if it lengthens the calling sequence, especially in languages with a high density of message sends, like Smalltalk.
2
Code size can be reduced for most techniques by moving some instructions from the caller to the callee, but only at the expense of a slower dispatch. (LC's code size requirements could be dramatically reduced by doing the lookup out-of-line.) The size of the immediate field in an instruction significantly impacts the code cost of SC, RD, and CT. This study assumes a 13-bit signed immediate field, limiting the range 1 We choose not to include the call instruction in each dispatch sequence in the space cost since this instruction is required for direct function calls as well. To include the call instructions, just add c to each entry in Table 8 .
a Two instructions (sethi and setlo) to pass the selector to the lookup routine that actually implements the dispatch table search. b Actually, there is a small overhead involved. Every class needs to store the representational offsets of its ancestors. This is much cheaper than storing an offset for every method understood. The Smalltalk system measured had 5087 selectors, and thus the selector number fits into an immediate. SC needs only one instruction to load the selector code into a register (see Table A -2), but RD takes two instructions for the same action because the selector offset needs two more bits (both zero) to address a word-aligned method. The same phenomenon increases the method prologue overhead in both RD and CT.
2 In RD, the reduction in data structure size relative to SC is almost offset by a corresponding increase in code size. The data in Figure 15 are thus relative to the processor architecture. For example, for an architecture with larger immediates (or for smaller applications), CT's space advantage over VTBL would double. Of course, the data also depends on application characteristics such as the proportion of call sites versus number of classes, selectors, and methods. Given these admonitions, IC and PIC apparently combine excellent average speed with low space overhead. The bounded lookup time of SC and RD is paid for with twice as much memory; VTBL is about one third smaller than those two. CT's small data structure size is offset by its code cost. VTBL, RD, and SC require significantly more data space than DTS because they duplicate information. Each class stores all the messages it understands, instead of all the messages it defines. For example, in the Smalltalk system a class inherits 20 methods for each one it defines [Dri93b] , so the number of entries stored in the class' dispatch table increases by a factor of 20. Dynamic typing makes a relatively small difference in space cost. Dynamic techniques have no extra overhead because each dispatch already contains a run-time check to test for the cache hit. Static techniques 3 perform the run-time type check in the method prologue, so the overhead grows linearly with the number of defined methods, which is much smaller than the number of call sites. 1 The size of immediates varies from architecture to architecture: for example, SPARC has 13 bits, Alpha 8 bits, and MIPS 16 bits. 2 Here the crucial quantity is the number of bits necessary to represent a cid (16 bits for the Smalltalk example). For the same reason, CT's dynamic typing cost is higher. 3 Excluding VTBL, which only works for statically-typed languages. 
Other aspects
The choice of a dispatch technique is influenced by considerations other than space cost and execution speed. A detailed discussion of these factors is beyond the scope of this paper, so we will only briefly mention some of them.
• • Sharing code pages. Some operating systems (e.g., Unix) allow processes executing the same program to share the memory pages containing the program's code. With shared code pages, overall memory usage is reduced since only one copy of the program (or shared library) need be in memory even if it is used concurrently by many different users. For code to be shared, most operating systems require the code pages to be read-only, thus disallowing techniques that modify program code on-thefly (e.g., IC and PIC). Many of the dispatch techniques discussed here can be modified to address problems such as those outlined above. For example, static techniques can be made more incremental by introducing extra levels of indirection at run-time (e.g., by loading the selector number rather than embedding it in the code as a constant), usually at a loss in dispatch performance. For this study, only the simplest and fastest version of each technique was considered, but any variant can be analyzed using the same evaluation methodology. Rose [Ros88] analyzes dispatch performance for a number of table-based techniques, assuming a RISC architecture and a scalar processor. The analysis included both dispatch and tag checking code sequences. The study considers some architecturerelated performance aspects such as the limited range of immediates in instructions. Other studies have analyzed the performance of one or two dispatch sequences. For example, Ungar [Ung87] analyzes the performance of IC, LC, and no caching on SOAR, a RISC-processor designed to run Smalltalk. Driesen [Dri93b] analyzes algorithmic issues of a number of dispatch techniques for dynamically-typed languages, but without taking processor architecture into account. Hölzle et al. [HCU91] compare IC and PIC for the SELF system running on a scalar SPARC processor. Milton and Schmidt [MS94] compare the performance of VTBL-like techniques for Sather. None of these studies takes superscalar processors into account. Calder et al. [CG94] discuss branch misprediction penalties for indirect function calls in C++. Their measurements of several C++ programs indicate that inline caching might be effective for many C++ programs (although measurements by Garrett et al. [G+94] are somewhat less optimistic). Calder et al. propose to improve performance with "if-conversion," an inline cache with a statically determined target. For each call site the address of the most frequently called function is determined from execution profiles. We have considered single dispatch only; multiple dispatch techniques are discussed in [KR90] and [AGS94] . However, singly-dispatched calls are so frequent even in systems offering multiple dispatch that implementations usually special-case these calls. Ingalls [Ing86] shows how to implement multiple dispatch with a sequence of single dispatch, but such implementations may not be optimal [AGS94] . Dispatch overhead can also be reduced by eliminating dispatches (rather than just making them fast). For example, the SELF-93 system inlines 95% of all dispatches [Höl94] with compiler optimizations such as customization [CUL89] and type feedback [HU94] . Similarly, concrete type inference [OPS92, VHU92, APS93, PC94, AH95] or link-time optimizations [App88, Fer95] can determine the concrete receiver types of calls, possibly eliminating dynamic dispatch for many sends.
Related work
Conclusions
We have evaluated the dispatch cost of a range of dispatch mechanisms, taking into account the performance characteristics of modern pipelined superscalar microprocessors. On such processors, objectively evaluating performance is difficult since the cost of each instruction depends on surrounding instructions and the cost of branches depends on dynamic branch prediction. In particular, some instructions may be "free" because they can be executed in parallel with other instructions, and unpredictable conditional branches as well as indirect branches are expensive (and likely to become more expensive in the future). On superscalar architectures, counting instructions to estimate performance is highly misleading. We have studied dispatch performance on three processor models designed to represent the past (1992), present (1995), and future (1997) state of the art in processor implementation. We have analyzed the run-time performance of dispatch mechanisms as a function of processor characteristics such as branch latency and superscalar instruction issue, and as a function of system parameters such as the average degree of polymorphism in application code. The resulting formulas allow dispatch performance to be computed for a wide range of possible (future) processors and systems. In addition, we also present formulas for computing the space cost of the various dispatch techniques. Our study has produced several results:
• The relative performance of dispatch mechanisms varies strongly with processor implementation. Whereas some mechanisms become relatively more expensive (in terms of cycles per dispatch) on more aggressively superscalar processors, others become less expensive. No single dispatch mechanism performs best on all three processor models.
• Mechanisms employing indirect branches (i.e., all table-based techniques) may not perform well on current and future hardware since indirect branches incur multicycle pipeline stalls, unless a branch target buffer is present. Inline caching variants pipeline very well and do not incur such stalls. On deeply pipelined superscalar processors like the P97, inline caching techniques may substantially outperform even the most efficient table-based techniques.
• Hybrid techniques combining inline caching with a table-based method may offer both excellent average dispatch cost as well as a low worst-case dispatch cost.
• On superscalar processors, the additional cost of supporting dynamic typing or multiple inheritance is small (often zero) because the few additional instructions usually fit into otherwise unused instruction issue slots.
• Instructions (in particular, per-call code) can contribute significantly to the overall space requirements of message dispatch. In our example system, many techniques spend more memory on dispatch code sequences than on dispatch data structures. Thus, minimizing dispatch table size may not always be the most effective way to minimize the overall space cost, and may in some cases even increase the overall space cost. Even though selecting the best dispatch mechanism for a particular system is still difficult since it involves many factors, the data presented here should allow dispatch speed and space costs to be accurately estimated for a wide range of systems. Therefore, we hope that this study will be helpful to system implementors who need to choose the dispatch mechanism best suited to their needs.
