Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this article, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of the hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have near-optimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4-level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70% of peak performance.
INTRODUCTION
An option contract is a financial instrument that gives the right to its holder to buy or sell a financial asset at a specified price referred to as strike price, on or before the expiration date. The current asset price, volatility of the asset, strike price, expiration time, and prevailing risk-free interest rate determine the value of an option. Binomial and trinomial option valuation are two popular approaches that value an option using a discrete time model [Kwok 1998; Cox triop with depth n = 4 and 2n + 1 = 9 leaves. et al. 1979] . The binomial option pricing computation is modeled by the directed acyclic pyramid graph G (n) biop with depth n and n + 1 leaves shown in Figure 1 . Here the expiration time is divided into n intervals (defined by n+ 1 endpoints), the root is at the present time, and the leaves are at expiration times. We use G (n) biop to determine the price of an option at the root node iteratively, starting from the leaf nodes.
The trinomial model improves over the binomial model in terms of accuracy and reliability [Kwok 1998 ]. The trinomial option pricing computation is represented using the directed acyclic graph with in-degree 3 denoted G (n) triop of depth n on 2n + 1 leaves shown in Figure 2 . As in the binomial model, we divide the time to expiration into n intervals and let the root be at the present time and the leaves be at expiration times. As in the binomial model, we use G (n) triop to determine the price of an option at the root node iteratively, starting from the leaf nodes. The trinomial model assumes that the price of an asset can go three ways: up, down, and remain unchanged. This is in contrast to the binomial model where the price can only go two ways: up and down.
The number of floating-point operations for pricing an option in both models with n intervals is of the order of n 2 . We compute a better estimate of the option price with increasing value of n [Kwok 1998 ]. Gerbessiotis [2004] , Higham [2002] , and Thulasiram and Bondarenko [2002] have proposed sequential and The multicore processors add another dimension to the memory hierarchy. They have varying degrees of sharing of caches by cores at different levels. Most architectures have cores with a private Level-1 cache. Depending on the architecture, a Level-2 cache is shared by two or more cores and a Level-3 cache is shared by four or more cores. The main memory is typically shared by all cores. The degree of sharing at a level varies from one multicore processor to another. These characteristics have been captured in a new model, called the Unified Memory Model, which is described in Zubair [2008, 2009] .
The rest of the article is organized as follows. The general question of designing algorithms for efficient use of memory hierarchies is discussed in Section 2. Section 3 describes the computational requirements for the two option pricing models. Section 4 develops memory traffic bounds using the memory hierarchy model. In Section 5 we propose optimal algorithms for valuing options using the binomial and trinomial option pricing models. Section 6 discusses implementation and experimental results for the proposed algorithms. In Section 7 we examine cache-oblivious algorithms for memory management. Finally, in Section 8 we draw conclusions.
MEMORY HIERARCHIES
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy [Hennessy and Patterson 2007; Kumar et al. 1996] . A typical processor today consists of five levels of memory. The level closest to the CPU is Level-0, which is a set of registers, typically 32 to 128; Level-1 and Level-2 are cache memories; Level-3 is the main memory followed by hard disk at Level-4. Some example processors with two level caches are the Intel Pentium III, Sun Ultrasparc IIIi, and IBM Power 3. (See Table I .) As we consider problem sizes that easily fit in main memory available on current systems, we focus on a 4-level memory hierarchy (Level-0 to Level-3). Additionally, our experiments were on a machine that uses multilevel inclusion policy for its memory hierarchy. In memory hierarchies, either the multilevel inclusion or exclusion policy is enforced. In the former, a copy of the value in each location in a Level-l cache is maintained in all higher-level caches. These copies may be dirty, that is, not currently consistent with the value in the lowest level cache containing the original, and are updated as needed. The exclusion policy, which applies to the above rules, does not reserve space for values held in lower-level caches. The results are derived for this case. However, they also hold for the inclusion policy when the memory associated with a cache in the lower bounds is the difference between the capacity of a cache and that of all its subcaches. The cache blocking technique is used to reduce memory traffic to slower memories in the hierarchy [Hennessy and Patterson 2007] . Cache blocking partitions a given computation such that the data required for a partition fits in a processor cache. For computations, where data is reused many times, this technique reduces memory traffic to slower memories in the hierarchy [Hennessy and Patterson 2007] . The cache blocking technique has been extensively applied to linear algebra applications [Dongarra et al. 1990; Anderson et al. 1999; Kågström et al. 1998; Gupta et al. 1998; Goto and van de Geijn 2008; Agarwal et al. 1994a] . Since accessing data from a slower memory is expensive, an algorithm that rarely goes to slower memory performs better. Level-0 blocking helps in reducing the number of load/store instructions by bringing the data into registers and reusing it. Blocking for Level-1 and Level-2 caches increases the reuse from the respective caches and helps in reducing the traffic to the slower level of memory. The amount of memory traffic that can be reduced between different levels of memory depends on the application, memory hierarchy architecture, and the effectiveness of the blocking algorithm.
Another factor that has an influence on the memory traffic is the storage of data. Gustavson [2003] has shown that the inefficiency due to the storage of two-dimensional arrays both in Fortran and C programming languages can be addressed using a new generalized data structure for storing these arrays. One issue with most of these algorithms is that they need to be parameterized to be able to work on different processors with different cache sizes. For this reason various researchers have explored cache-oblivious algorithms [Frigo et al. 1999; Penner 2004] . However, portability comes at a price. Yotov et al. [2007] have experimentally demonstrated that even highly optimized cache-oblivious programs perform significantly worse than corresponding cache-aware programs for dense linear algebra applications. They point to two major reasons for this performance gap: ineffective utilization of the pipeline by cache-oblivious algorithms; and the inability to effectively hide memory latency by cache-oblivious algorithms.
•
7:5
Another reason that portability reduces performance is the difficulty in blocking for Level-0 memory (registers) by cache-oblivious algorithms. A typical cache-oblivious algorithm works by recursively partitioning the computational domain until a computation size is reached that is determined by the call overheads. Stopping the recursion of a cache-oblivious algorithm without being aware of the number of registers available on the processor can lead to ineffective blocking for registers. Modern compilers are capable of unrolling and performing tiling to block for registers. However, an explicit blocking for registers is required in many cases.
OPTION PRICING MODELS
In this section, we describe the computational requirements for option pricing for two standard models. In particular, we describe the computation for pricing a put option contract that gives the right to its holder to sell an asset whose current price is Q at a strike price K ≤ Q with the expiration time T . We assume that the prevailing risk-free interest rate is r, and volatility of the asset is ν. To illustrate the computation for both models, we divide the expiration time into n intervals with each time interval d t = T/n. For more details on these models, please refer to Kwok [1998] and Cox et al. [1979] .
Binomial Model
We use a G (n) biop with time interval d t = T/n to illustrate the computation. In G (n) biop the level increases as we go up the tree. We identify the ith node at level j by ( j, i) , where 1 ≤ j ≤ n + 1 and 1 ≤ i ≤ n + 2 − j . As part of initialization we define asset and option prices at leaf nodes ( j = 1). Asset price q , is simply the option payoff at the node, which is given by c
. Next we iteratively compute option prices at nodes at level j + 1 using prices at level j as defined below:
Here, c j i and q j i = Qd n u 2(i−1)+ j −1 are the option price and asset price, respectively, at ( j, i) . Also, p u and p d are pseudoprobabilities given by
The final output, c n+1 1 is the option price at the root node. Note that computations (2) and (3) are only required for American options. From the memory traffic perspective, the difference between American and European options is that American option requires access to an additional array that stores asset prices.
The computation for a call option is similar except that the expression for payoff (3) is replaced by
In this article, we do not make a distinction between call and put options because from the computation and memory perspective they are identical. It is possible, though to optimize computations further for special cases, for example, American call options without dividends are never exercised early [Kwok 1998 ] and are treated as European options.
Trinomial Model
We identify the ith node at level j by ( j, i) , where 1 ≤ j ≤ n + 1 and 1 ≤ i ≤ 2n + 1 − 2( j − 1). As part of initialization, we define asset and option prices at leaf nodes ( j = 1). Asset price q Here, λ is a free parameter and a value of 1 reduces this model to a binomial model. The initial price of the option at node (1, i) is simply the option payoff at the node, which is given by M AX (K − q 1 i , 0). Next we iteratively compute option prices at nodes at level j + 1 using prices at level j as defined below:
Here, c j i and q j i are the option price and asset price, respectively, at ( j, i); and p u , p m , and p d are pseudoprobabilities given by
Note that computations (5) and (6) are only required for American options. The final output, c n+1 1 is the option price at the root node. The computation for a call option is similar except that the expression for payoff is changed as described in binomial computation for the call option. 
BOUNDS ON MEMORY TRAFfiC
The memory hierarchy model described below was introduced to develop lower bounds on the memory traffic between adjacent levels in a memory hierarchy [Savage 1995] . (See also Savage [1998] , Chapter 11.) This model is an extension of the red-blue model introduced by Hong and Kung [1981] , a game played on directed acyclic graphs with red and blue pebbles in which red (blue) pebbles denote primary (secondary) storage locations. An I/O operation occurs when a blue pebble is placed on a vertex carrying a red pebble or vice versa. The memory hierarchy game described below extends this model to multiple levels. These models have been applied to matrix multiplication, FFT, and applications involving permutations. We use the hierarchical model to derive lower bounds on a 4-level memory hierarchy for option pricing using the binomial and trinomial models. This model captures the details of memory hierarchies that suffice to make explicit the essential dependence of algorithms on the sizes of multiple caches Other models have been proposed to capture similar aspects of this problem. See, for example, Savage and Vitter [1987] , Aggarwal et al. [1987a Aggarwal et al. [ , 1987b , and Vitter [2006] .
The Memory Hierarchy Game
We assume that the capacity of memory at level-l is σ l , for 0 ≤ l ≤ 2, which is the number of words it can hold. We further assume that the caches use the multilevel inclusion policy, which implies that data in the Level-l cache is a subset of the data in the Level-(l + 1) cache.
Let T l (σ , G) denote the memory traffic between levels l and l − 1 in the hierarchy, where traffic is measured by the number of words that move between the levels and σ is a vector denoting the amount of memory available at each level. Later we derive lower bounds on T l (σ , G) for G (n) biop and G (n) triop . The key to deriving these bounds is to compute the S-span of G (n) biop and G (n) triop , which is defined in Section 4.2. 4.1.1 Rules of the Memory Hierarchy Game. The 4-level Memory Hierarchy Game (MHG) is played on directed acyclic graphs (DAGs) with σ l pebbles at Level-l , 0 ≤ l ≤ 2, and an unlimited number of pebbles at level 3. Placement of a Level-l pebble on a vertex corresponds to moving the data associated with the vertex to the Level-l memory. Computations are done on data in the firstlevel memory. Zero-level computations can only be done at a vertex if the data needed is in zero-level memory, that is, the predecessors of the vertex carry zero-level pebbles. Data is moved to Level-l on a vertex only from Level-l − 1 or Level-l + 1. This is possible only if a pebble at Level-l − 1 or Level-l + 1 resides on the vertex. The full set of rules of the MHG is given below.
-R1 (Computation step). A zero-level pebble can be placed on any vertex all of whose immediate predecessors carry zero-level pebbles. -R2 (Pebble deletion). Except for Level-3 pebbles on output vertices, a pebble at any level can be deleted from any vertex. The MHG has resource vector σ = (σ 0 , σ 1 , σ 2 ), where σ j ≥ 1 for 0 ≤ j ≤ 2 is the number of storage locations at Level-l . As we are assuming a multilevel inclusion policy, the total number of storage words up to and including Level -l is also σ l . Zero-level pebbles can slide from a predecessor to a successor vertex, which corresponds to using a register as both the source and target of an operation.
Computational Inequalities
Definition 1. The S-span of a DAG G, ρ (S, G) , is the maximum number of vertices of G that can be pebbled in a zero-level pebble game starting with any initial placement of S red pebbles.
Once we have the S-span of a dag, we apply the following results of MHG developed first in Savage [1995] and strengthened in Savage and Zubair [2009] to derive lower bounds on T l (σ , G).
THEOREM 4.1. Consider a pebbling of the DAG G with n input and m output vertices in an L-level memory hierarchy game under the multilevel inclusion policy. Let ρ(S, G) be the S-span of G and |V * | be the number of vertices in G other than the inputs. Assume that ρ(S, G)/S is a nondecreasing function of S.
Then 
Below we derive new upper bounds on the S-span of the graphs G (n)
biop and G (n) triop which, with the above result, provides new lower bounds on T
and |V * | = n 2 for the two graphs, respectively. PROOF. Initially all vertices are unpebbled and all paths from inputs to the output are pebble-free. There is some last path p n+1 1 from an input to the output that is free of pebbles. This path has n + 1 vertices. When a pebble is placed on the input to p n+1 1 , the graph already had pebbles on each of the paths leading to each of the n other vertices on p n+1 1 . Thus, when the input to p n+1 1 is pebbled, the graph has at least n + 1 pebbles on it.
Lower Bounds for G
To show that the graph can be pebbled completely without repebbling any vertices, place all n + 1 pebbles on the inputs, slide the leftmost pebble up one level, and then proceed to slide the remaining pebbles up one level to pebble the leaves of the subgraph G (n−1) biop with n leaves. The rest follows by induction.
PROOF. Let V be the set of vertices that are pebbled from an initial placement of S pebbles. These are vertices that are pebbled without using any of the caches. V consists of one or more disjoint sets of vertices. Let V 1 , V 2 , . . . , V m be these sets and let s 1 , s 2 , . . . , s m be the number of pebbles placed on the vertices of these sets initially. Then, S = i s i .
Consider one of these sets, say V i . Let k i be the number of vertices on the longest directed path p * in this set. (Edges are directed from leaves toward the output.) Let v L be the last vertex on p * . It follows that V i cannot be larger than the subgraph G
biop that contains p * . Using the argument of Lemma 4.1, there must be at least k i pebbles in V i when the last path to v L is closed. Thus, s i ≥ k i . Furthermore, it is possible to pebble all vertices in G
biop with k i pebbles. The number of vertices that are pebbled in V i other than the vertices carrying pebbles initially is at most
The total number of vertices that can be pebbled, τ , satisfies
Applying Theorem 4.2 we have the following result. 
Lower Bounds for G (n) triop
In the case of trinomial option pricing, the memory traffic is governed by the trinomial graph shown in Figure 2 . PROOF. Initially all vertices are unpebbled and all paths from inputs to the output are pebble-free. There is some last path p n+1 1 from an input to the output that is free of pebbles. This path has n internal vertices plus the leaf vertex. When a pebble is placed on the input to p n+1 1 , the graph already had pebbles on each of the paths leading to each of the n other vertices on p n+1 1 . Since there are two such paths per vertex, when the input to p n+1 1 is pebbled, the graph has at least 2n + 1 pebbles on it.
The graph can also be pebbled with 2n + 1 pebbles without repebbling vertices by sliding the pebbles up one level starting with the leftmost pebble and proceeding to the right. A total of (n + 1) 2 steps is needed.
PROOF. The proof of this result is similar to that of Theorem 4.2. The only difference is that the number of vertices that can be pebbled in G (k) triop with 2k +1 leaves from a starting position in which 2k + 1 pebbles reside on the leaves is
2 , as we now show. If the ith set, V i , which has s i pebbles on it initially, has a longest path of length k i (it has k i + 1 vertices), then s i ≥ 2k i + 1. A total of τ vertices can be pebbled where
Again applying Theorem 4.1, we have the following result. 
OPTIMAL ALGORITHMS
We now develop cache-blocking algorithms that greatly reduce the traffic between levels in a memory hierarchy. They are based on the recursive partition of the computation into smaller blocks, where blocks at each level of recursions fit into the corresponding level of memory hierarchy. Partitioning at the first recursion results in blocks that fit in the Level-2 cache; partitioning at the next recursion results in blocks that fit in the Level-1 cache, and so on. This partitioning ensures that once we bring the required data for a block into a faster memory in hierarchy, we reuse the data a sufficient number of times before bringing in the data for the next block. This reduces the traffic between different levels of the memory hierarchy. The algorithm for binomial option pricing is similar to the one proposed in Zubair and Mukkamala [2008] for a singlelevel cache for European option pricing. However, the algorithm we propose in this article for binomial option pricing works for multiple levels of the memory hierarchy.
• 7:11
Binomial Option Pricing
We present a multilevel algorithm (Algorithm 1), the outermost loop of which is shown below. This algorithm assumes that data in G (n) biop is recursively partitioned into blocks at different levels of the memory hierarchy and that the computation is done from the leftmost leaf. We could equally well have chosen to compute from the rightmost leaf.
The first level of partitioning is for the Level-2 cache and is illustrated in Figure 3 . G 
has α(α+1)/2 blocks, all of which are rhombuses except for the α triangles on the left boundary. Replacing these triangles by rhombuses increases the number of operations by α(m 2 /2), a fraction of 1/(α + 1) of the total.) Replacing triangles by rhombuses has no effect on the number of I/O operations.
The algorithm processes the blocks in the order b Hence, the total number of values needed to process a block 2 of size m is 2m, which is also the storage requirement for the Level-2 cache, σ 2 . Algorithm 1 is a high-level description of the outermost algorithm recursion. We ignore the special handling of the first row of blocks as their first row does not require any computation.
To handle data movement between the Level-1 and Level-2 caches, each m × m rhombus is decomposed into (m/q) 2 q × q rhombuses. By the reasoning given above, the storage requirement for Level-1 cache is σ 1 = 2q. Finally, each q × q rhombus is decomposed into (q/r) 2 r × r rhombuses. The storage requirement for the Level-1 cache is σ 0 = 2r. Here we assume that r 2 Strictly speaking, the number of q j i s needed to process a block is one less than the required number of c j i s.
7:12
• J. E. Savage and M. Zubair divides m and q. A high-level description of the second algorithm is given in Algorithm 2.
5.1.1 Memory Traffic. We estimate the memory traffic between main memory (Level-3 cache) and the Level-2 cache by observing that, while processing a typical block, we replace m values in the Level-2 cache, resulting in a total I/O traffic of 2m values. Note that the total I/O traffic for the first block in the second nested loop of Algorithm 1 is 4m, as the number of values that need to be replaced for processing this block is 2m. However, the memory traffic is dominated by the processing of typical blocks with I/O traffic of 2m values per block. To get the total traffic, we need to multiply the average memory traffic for processing a block by the number of blocks of size m. It follows that the total memory traffic between the Level-3 (main memory) cache and Level-2 cache, T 3 , is given by
Using the same reasoning, we can get an estimate for T 2 and T 1 :
.
From Theorem 4.6 we observe that T 3 , T 2 , and T 1 for the proposed algorithm are optimal to within a constant factor of 8. 
Trinomial Option Pricing
Without loss of generality, the trinomial option pricing algorithm is evaluated from the leftmost vertex. The blocking algorithm for G (n) triop is similar to the blocking algorithm of G biop , we treat an incomplete rhombus as a complete one and assume that m divides 2n + 1. Using similar reasoning as for the binomial, we find the total number of values needed to process a block of size m is σ 2 = 3m.
The order in which the rhombuses are visited makes a difference for this case although it has no effect for the binomial option pricing model. If the rhombuses are visited on diagonals slanting up and to the left, 4m I/O operations are required for each rhombus. However, if the rhombuses are visited by rows, only 2m I/O operations are required. We have implemented both methods. A highlevel description of the outermost algorithm recursion for the first ordering is given in Algorithm 3. We ignore the special handling of the first row of blocks. 5.2.1 Memory Traffic. We estimate the memory traffic between main memory (Level-3 cache) and the Level-2 cache for the two orderings. For the first ordering, while processing a typical block we replace 2m values in the Level-2 cache, resulting in a total I/O traffic of 4m values. For the second ordering, the I/O traffic per block consists of 2m values. Note that the total I/O traffic for blocks on the left boundary is 6m because the number of values that need to be replaced to process them is 3m. However, the memory traffic is dominated by processing of typical blocks with I/O traffic of 4m values per block. To get the total traffic, we need to multiply the average memory traffic for processing a block by the number of blocks of size m. It follows that the total memory traffic between the Level-3 (main memory) cache and the Level-2 cache for the two orderings, T Using the same reasoning, we get estimates for T (i) 2 and T
1 , i ∈ {1, 2}: From Theorem 4.9 we observe that T
3 , T
2 , and T
(1) 1 for the proposed algorithm are optimal to within a factor of 12 whereas T 1 are optimal to within a factor of 6. This difference is due to the observation made earlier that the horizontal ordering results in memory traffic is a factor of 2 less compared to the diagonal ordering. 
IMPLEMENTATION AND EXPERIMENTAL RESULTS
We implemented the proposed algorithms on a Sun Workstation with Solaris 10 OS and UltraSPARC IIIi processor operating at 1050 MHz. The UltraSPARC IIIi processor has a 64-kB Level-1 data cache organized as 4-way set associative and has a line size of 32 bytes. The data cache can hold up to 8K double-precision floating-point data. The processor also has a Level-2 cache of 8 MB which is shared between instruction and data. As we considered problem sizes up to a maximum of 64K leaf nodes for both models, we could accommodate all the required data in the Level-2 cache. Thus for our experimentation, we ignored partitioning for Level-2. The UltraSparc IIIi processor executes two floating-point instructions in one cycle, so the peak performance of the processor is 2.1 GFLOPS. To evaluate the performance of various algorithms, we used wall clock execution time. To evaluate how well a given algorithm matches the underlying architecture, we also computed algorithm performance as the percentage of the theoretical peak performance for the target machine. For example, when we got 1.05 GFLOPS on the Sun Workstation, our code was running at 50% of the peak. All our algorithms were implemented using Fortran 90/95. We compiled all our code including vanilla code with the "-fast" option, which combines various complementary optimizations for the target processor [Garg and Sharapov 2001] . Where specified, for some results we also used the "-nodepend" option to turn off compiler-based cache blocking. As mentioned earlier, to keep our implementation simple, we treated incomplete rhombuses as complete rhombuses. We handled this in the implementation by padding the arrays, storing asset and option price with zeros. This did not have a significant impact on the performance numbers reported in this article.
The performance results for the binomial optimal algorithm when rhombuses are visited on diagonals slanting up and to the left are summarized in Table II . We also include the results for the binomial optimal algorithm when rhombuses are visited by rows in Table III . As expected, the results are similar for both the algorithms. For these results, we compiled the code using the "-fast" and "nodepend" options of the Sun compiler. The first column indicates problem size (we use the number of leaves in the binomial model as the problem size). Observe that we obtained around 70% of the peak performance for this algorithm. One major reason we did not do better than 70% of the peak was due to the the use of the MAX function inside the nested loop (Equation (3) in Section 3). The use of this function creates a bottleneck for the pipeline as it results in a branch instruction inside the nested loop [Hennessy and Patterson 2007] . The results reported in the Table II are for a block size of 512 for Level-1 cache blocking. This value was chosen as a result of an experiment. The maximum size of the Level-1 block was determined by the size of data cache on the processor. Figure 5 plots execution times for various block sizes for n = 65336. From the figure, it is clear that once we went beyond a block size of 4096, the performance dropped. We selected the block size of 512, which is in the acceptable range.
For Level-0 blocking, we selected a block of size 4. The maximum size of Level-0 blocking is determined by the number of registers available on the processor. A register tiling improves the ratio of the number of floatingpoint operations to the number of loads/stores. To understand this, consider Algorithm 5 for a 2 × 2 register tiling based on our optimal algorithm. Note that lines 3 to 9 corresponds to load data from the Level-1 cache into the registers; and lines 22 to 27 correspond to stores from the register into the cache. The rest of lines 10-21 correspond to computations. Observe that a 2 × 2 tiling improves the ratio of the number of floating-point operations to the number of loads/stores from 6/5 to 24/13 (a MAX operation is counted as two floating-point operation A); and it also increases the number of instructions inside the main body giving flexibility in scheduling to minimize 
← y 3 28 end 29 end data dependency. To see the number of floating-point and load/stores operations for an algorithm without tiling, please refer to Equations (1)-(3). We have three floating-point operations due to (1). Note we do not count multiplication with an exponential as it is done outside the inner loop. There is one floating-point operation for (2), and two for (3), resulting in a total of six floating-point operations. There are three loads for c , for a total of five loads/stores. The count for 2 × 2 tiling can be seen from Algorithm 5. A 4 × 4 tiling increases the ratio further to 96/29. In general, for a r × r tiling this ratio is given by 6r 2 /(8r − 3). However, as we increase tiling we increase the required register count. Once we increase tiling beyond the number of available registers on the processor, we start observing spilling in the compiled code that offsets the advantage of tiling. Spilling occurs when the compiler transfers some variables from registers to cache, resulting in slower access. In other words, the number of available registers in the processor limits the largest amount of tiling. For our implementation, we found 4 × 4 tiling to be optimal. It should be mentioned here that modern compilers are capable of loop unrolling and in general are efficient. However, for nested loops with varying bounds like in our case, a user with a knowledge of runtime constraints can do a better job of unrolling.
For comparison, we implemented a vanilla algorithm, which refers to a straightforward implementation of binomial option pricing without any explicit cache blocking for Level-0 and Level-1. The performance results for the vanilla algorithm are summarized in Table IV . For these results, we compiled the vanilla code with "-fast" and "-nodepend" options of the Sun compiler. The first column indicates problem size (note that we used the number of leaves in the binomial model as the problem size). Observe that the vanilla algorithm's performance varied from 8% to 9% of the peak performance as compared to the proposed optimal algorithm that achieved approximately 70% of the peak performance. We also compiled the vanilla code with the "-fast" option and without the "-nodepend" option, thus letting the Sun compiler do cache blocking and unrolling. The results of this experiment are summarized in Table V . Observe, that the compiler-based cached blocking did improve the performance but was still a factor of 2 lower than the optimal algorithm.
The performance results for the two variations of the optimal algorithm for the trinomial model are summarized in Tables VI and VII. Observe that there was a slight improvement for the case when rhombuses was visited by rows, but it was not significant. We suspect that this was due to the interplay between various parameters such as set-associativity, block size, and replacement policy. For comparison we include the results for the vanilla algorithm in Tables VIII  and IX. The optimal algorithm performance results shown in Tables VI and  VII are for a block size of 255 for Level-1 blocking. As in the binomial implementation, we treat all blocks as complete rhombuses by padding the arrays with zeroes. For Level-0 blocking, we used a block size of 5, resulting in a 5 × 5 register tiling.
CACHE-OBLIVIOUS IMPLEMENTATION
A cache-oblivious algorithm [Frigo et al. 1999 ] is one that does not depend on cache parameters such as cache size. Observe that our proposed algorithms partition the computation into blocks of sizes that fit in cache. In other words, we need to know the cache size to implement our optimal algorithms. The number of levels in the memory hierarchy determines the number of recursions in our algorithms. We show that our algorithms can be made independent of these parameters. We also prove that the adapted algorithms are optimal for the cache-oblivious model. The cache-oblivious model consists of a processor with an one-level ideal cache of size Z (it holds Z words) and a line size of L. The performance of a cache-oblivious algorithm is measured by the number of cache misses it experiences. When a cache miss occurs, a line of L words is retrieved from secondary memory. Each cache miss requires that data be retrieved from secondary memory. Note also that, to create space in the cache, a like amount of data generally has to be moved to secondary storage, a fact that may double the amount of I/O. For more details, see Frigo et al. [1999] .
A Cache-Oblivious Algorithm for the Binomial Model
For simplicity we describe the algorithm for a complete rhombus that contains the binomial pyramid of N leaves, 3 where N is a power of 2. The algorithm, similar to the cache-aware algorithms proposed in this article, recursively partitions the computation into smaller blocks. Here we partition a rhombus into four rhombuses of equal size. That is, a rhombus of size N at the start of recursion is partitioned into four rhombuses,
We continue the recursion until we reach a small size determined by the call overheads. We illustrate the recursive kernel in Algorithm 6, where we stop the recursion when we have a rhombus of size 1. Note that the first call to the recursive algorithm is with l = 0, and the last call is with l = log 2 N − 1.
Compute option price for single node using binomial computation (Equations 1-3) 10 end 7.1.1 Analysis. Let P (N ) and M (N ) be estimates for the number of cache misses for a pyramid of size N and the containing rhombus. We derive a bound on M (N ) and estimate P (N ) by M (N )/2 because the rhombus contains about twice as many vertices as the pyramid. We now obtain an approximate bound to M (N ) when the cache has size Z . Clearly, M (N ) ≤ 4M (N /2). At some point in the recursion, l , we reach a stage where we partition the rhombus of size Z /2 into four rhombuses R 1 i, j of size Z /4 each, for 1 ≤ i, j ≤ 2. Observe that data required to process a rhombus of size Z /4 fits in the cache of size Z .
We observe that the four rhombuses can be processed in the order R 
. To see how close this is to optimal, we compute the lower bound on the number of I/O operations using the results of Theorem 4.3. Since this bound takes into account both inputs to and outputs from the cache in terms of number of values (where a value consists of two words), we obtain a lower bound of N 2 /(4Z ) on the number of input operations in number of words. This translates into N 2 /(4Z L) misses when each miss results in retrieving L relevant quantities, an assumption we make. Thus, the proposed cache-oblivious algorithm is optimal within a factor of 24. Observe that the cache-aware algorithm proposed in Section 5.1 incurs 2N
2 /Z L misses on the cache-oblivious model, which is a factor of 3 improvement over the cache-oblivious algorithm. Below we show that this bound can be improved by subdividing each rhombus into more rhombuses.
For this reason and others as explained shortly, we believe the performance of the cache-oblivious algorithm will not be as good as that of the algorithm implemented in this article. The other major reason is the lack of appropriate register tiling in cache-oblivious algorithms as this would require the awareness of Level-0 size (number of registers). For example, if we stop the recursion for the cache-oblivious algorithm at some stage to enable register tiling by unrolling the code, we may perform register tiling that cannot be supported by the number of registers available in the processor. As a result there will be a number of spills resulting in performance degradation. On the other hand, if the recursion stops at a stage resulting in a register tiling of smaller size compared to what can be supported by the number of registers, we still have poor performance. In our experiments, we have observed that register tiling of the right size (that is blocking for Level-0 memory) is critical to the overall performance.
Experimental Results.
We implemented the cache-oblivious algorithm for binomial computing on the Sun workstation. The performance results are summarized in Table X. Note that these results correspond to a terminal rhombus (rhombus when the recursion stops) of size 1. We compiled the code using the "-fast" option of the Sun compiler. For comparison, we also implemented the cache aware version of our algorithm, discussed in Section 5.1, for the complete rhombus. These results are summarized in Table XI . Even with full compiler optimization, the performance of the cache-oblivious algorithm is quite low. One major reason is the call overheads of the recursion. We can reduce this by stopping the recursion earlier, that is, when terminal rhombus is of size greater than 1. We conducted an experiment for a problem size of 65536, where we measured the performance of the cache-oblivious algorithm for different terminal rhombus sizes. We plot these results in Figure 6 . Note that for these results the code was compiled using the "-fast" option. These results indicate that the performance improved with increasing size of terminal rhombus.
We observe a maximum performance of 38% of the theoretical peak for terminal size of 128. Not all of this gain in the performance was due to the reduction of call overheads of the recursion. As the terminal size increased, the compiler was able to optimize the terminal rhombus computation by performing deep unrolling of loops and register tiling. This is obvious when we compare Figure 7 and Figure 8 corresponding to code compiled with "-fast -nodepend" and with no optimization options, respectively. As mentioned earlier, "-nodepend" turns off compiler-based blocking for registers and caches. Looking at these two figures, one can conclude that the initial performance gain in Figure 6 up to a terminal size of 8 is mainly due to the reduction in call overheads of the recursion. The performance gain beyond a terminal rhombus of size 8 is due to the compiler optimization of the terminal rhombus computation. In summary, the question of what is a good terminal rhombus size for the recursion to stop cannot be fully answered by ignoring the underlying processor architecture. This can limit the performance gain of a true cache-oblivious algorithm.
A Cache-Oblivious Algorithm for the Trinomial Model
We cannot use the approach outlined for binomial computation to derive the cache-oblivious algorithm for trinomial computation. Observe that the number of leaves in a trinomial pyramid is not a power of 2 and is given by N = 2n + 1, and there is no complete rhombus as in the case of binomial computation that contains the trinomial pyramid of N leaves. However, we can partition the parallelogram of size, n × 2n + 1, containing the trinomial pyramid of N 0.5% Terminal Rhombus Size % Peak Fig. 8 . Performance of the cache-oblivious algorithm for various terminal rhombus sizes for n = 65536. These results correspond to code compiled using no compiler optimization option.
leaves into two rhombuses of size n × n each with one edge of nodes shared by both rhombuses. For the sake of keeping our implementation simple, we can restrict n to be a power of 2, and thus transform the trinomial computation into a processing of two rhombuses of sizes that are a power of two. We can recursively partition these rhombuses as in the case of binomial computation to derive a cache-oblivious algorithm for trinomial computation. We now outline the recursion for trinomial computation for one of these rhombuses. We start the recursion by partitioning the rhombus of size n into four rhombuses, R 1 i, j of size n/2 each, for 1 ≤ i, j ≤ 2. We continue the recursion until we reach a rhombus of size 1. Note that the first call to the recursive algorithm is with l = 0, and the last call with l = log 2 n − 1.
Algorithm 7. RECURSIVE PROCESSRHOMBUS(R
Compute option price for single node using trinomial computation (Equations 4-6) 10 end 7.2.1 Analysis. For the trinomial case let T (N ) be the number of cache misses on a trinomial pyramid with N leaves. Observe that T (N ) is also the estimate for the number of cache misses in processing one of the two initial rhombuses of size n = (N − 1)/2. We now obtain an approximate bound to T (N ) when the cache has size Z . Clearly, T (N ) ≤ 4T (N /2). At some point in the recursion, l , we reach a stage where we partition the rhombus of size Z /3 into four rhombuses R 1 i, j of size Z /6 each, for 1 ≤ i, j ≤ 2. Observe that data required to process a rhombus of size Z /6 fits in the cache of size Z .
We observe that the four rhombuses can be processed in the order R
Note that in Algorithm 7 we use the first ordering. For this ordering, 3Z /2 load operations are needed on the first and third rhombuses and Z /2 load operations on the other two. This is an average of Z load operations. Thus, we estimate T (Z /6) by Z /L because the Z load operations require a minimum of Z /L cache misses, which is achieved when each line that is retrieved contains relevant data. The recurrence for T (N ) follows:
The solution to this recurrence is
, which is the number of cache misses for trinomial computation.
To see how close this is to optimal, we compute the lower bound on the number of I/O operations using the results of Theorem 4.5. Since this bound takes into account both inputs to and outputs from the cache in terms of number of values (where a value consists of two words), we obtain a lower bound of N 2 /(Z ) on the number of input operations in number of words. This translates into N 2 /(Z L) misses when each miss results in retrieving L relevant quantities, an assumption we make. Thus, the proposed cache-oblivious algorithm is optimal within a factor of 36. Observe that cache-aware algorithm proposed in Section 5.2 incurs 1.5N 2 /Z L misses (horizontal ordering) on the cache-oblivious model, which is a factor of 24 improvement over the cache-oblivious algorithm.
Experimental Results.
We implemented the cache-oblivious algorithm for trinomial computing on the Sun workstation. The performance results are summarized in Table XII. Note that these results correspond to a terminal rhombus of size 1. We compiled the code using the "-fast" option of the Sun compiler. For comparison, we also implemented the cache-aware version of our algorithm for trinomial computing, discussed in Section 5.2, for the complete parallelogram. These results are summarized in Table XIII . Similar to binomial Terminal Block Size % Peak Fig. 9 . Performance of the cache-oblivious algorithm for various terminal rhombus sizes for n = 65536. These results correspond to code compiled using the "-fast" compiler option.
computing, cache-oblivious algorithm for trinomial computing performance is quite low. We also conducted experiments to see the impact of terminal rhombus size and the compiler optimization on the performance of cache oblivious algorithm; see Figures 9-11. These results indicate that the performance improves with increasing size of the terminal rhombus. We observe the maximum performance of 45% of the theoretical peak for terminal size of 128. As explained in the case of binomial computing, not all of this gain in the performance is due to the reduction of call overheads of the recursion. The performance gain beyond a terminal rhombus of size of 8 is due to compiler optimization of terminal rhombus computation.
Improving Cache-Oblivious Algorithms
It is possible to improve the miss complexity of cache-oblivious algorithms and make it closer to the one of cache-aware algorithms. We explain this for the binomial model. We subdivide a rhombus into 16 = 4 × 4 rhombuses of equal size. The reason this helps to decrease the average number of misses is that the fraction of the rhombuses that need half of the maximum I/Os increases. As we increase the number of rhombuses into which a rhombus is subdivided, the Terminal Block Size % Peak Fig. 10 . Performance of the cache-oblivious algorithm for various terminal rhombus sizes for n = 65536. These results correspond to code compiled using the "-fast -nodepend" compiler option. Terminal Block Size % Peak Fig. 11 . Performance of the cache-oblivious algorithm for various terminal rhombus sizes for n = 65536. These results correspond to code compiled using no compiler optimization option fraction of the rhombuses that need half of the maximum I/Os approaches 1. In this case, the cache-aware and cache-oblivious algorithms can give the same performance. However, when the rhombuses are small, the control over the order with which vertices are pebbled falls into the hands of the compiler. If the compiler is not good at data ordering, the overall performance of the algorithm will suffer, as our experimental results showed.
CONCLUSIONS
We have studied the cache-efficient implementation of an important computational problem, namely, options pricing using binomial and trinomial algorithms. We have modeled them as a pebbling problem on two graphs, the binomial and trinomial pyramid graphs. We have exploited an existing framework for the study of memory hierarchies to derive lower bounds on the amount of memory traffic that is needed to pebble these graphs. This has required deriving new bounds on the S-span of these graphs.
We have also given cache-aware memory blocking algorithms to implement the option pricing computation for general memory hierarchies in which cache sizes vary by level. These blocking algorithms give memory traffic which is within a constant factor of optimal. When they are specialized to the four cache levels of the UltraSparc IIIi processor, we have show-that the performance improved by a factor of up to 5 and the code operated at about 70% of the peak performance. It is possible to improve the performance further by algorithmic prefetching [Agarwal et al. 1994b ] and avoiding zeroes computations [Higham 2002 ]. We also suspect that it is possible to partition the nodes of binomial and trinomial trees, where K − q We have also proposed cache-oblivious versions of our partitioning algorithm and implemented them on the Sun workstation. The performance of the cacheoblivious algorithm is a factor of 16 lower compared to the cache-aware algorithm. We demonstrated that it is possible to improve the performance of the cache-oblivious algorithm significantly if we relax the constraint on the algorithm being truly oblivious. In particular, we showed that if we determine the terminal size by experimenting on the target hardware, it is possible to get a performance of a factor of 8. However, even with this improvement the cacheoblivious algorithm is a factor of 2 away from the cache-aware algorithm.
This exercise in algorithm-specific memory blocking has applications to a broader class of options pricing models.
