for image coding. This algorithm jointly searches for the best spatial segmentation and the best frequency decomposition to use for each segment. The main advantage of these adaptive representations is their versatility: They can adapt to a wide variety of image classes having varying space-frequency characteristics by searching efficiently through a very large library of tree-structured bases.
I. INTRODUCTION
Cache memories play an important role in achieving higher performance in modem uni-and multiprocessors. When a high percentage of reads and writes are made to the cache, the effective bandwidth of the memory is that of the cache. Many prior studies have focused on read caching [7] . Here, we focus on write caching. Write buffers [7] , write allocate [4] , and write through [l] , [5] do not address the removal of unnecessary traffic. To prevent unnecessary reads, many systems provide software control of cache write updating [6] . Word validate has been used by [l] , [4] , and [5] , and write allocate has been used by [4] .
C. M. Wittenbiink, in [9] , investigated the effect of directly updating the line when it was known in advance that the line is to be written by using trace analysis. In this paper, we further investigate the cache write technique cache write generate. Cache write generate directly updates the cache on write misses, without reading from memory. We show that for a class of applications, the overall performance improvement is significant. We performed the analysis using hardware description language (HDL) simulations and performance measurements of each cache write technique.
Cache write generate (CWG) is defined as cache write validation on a write miss. The cache line is updated with the write and the cache line tag is modified to the address of the write. Writes that benefit from CWG are computed or initialized by the processor. Examples include dynamically allocated memories, stack segments, static memory segments, and temporary buffers. In image processing and vision applications [3] , [8] , these memory areas are easy to identify through explicit declaration or by the compiler. CWG is done only on memory areas denoted as generate, and a cache line in a generate memory area may lose its CWG ability to insure memory consistency. We have developed several schemes to provide self consistency, but do not discuss them due to space constraints. See our paper [lo] for details.
SIMULATION MODELS AND HARDWARE SYSTEM

A. Cache Modes, Sizes, and Memory Timings
We compare the relative efficiency of the cache write generate policy to existing write caching controls, using single and multipro- We developed a detailed model of the Intel is60 RISC processor, a custom external cache, and a main memory using register transfer level simulation language, ISP-prime, used with the N.2 simulator. The simulation models were developed for architectural investigation while designing the UW-Proteus system [2] , [3] , [8] . The UW-Proteus cluster hardware (currently 32 i860's) was used for verification and performance measurement. The simulation models are shared memory multiprocessors, where we varied memory timings, cache sizes, cache control, number of processors, and external or no external caches. For performance comparisons, we consider the following three scenarios for caching. Case N, normal mode, is where the application is run with normal write back caching, with write around on miss. Read misses are cached. Case A, allocate mode, is write caching on write misses in addition to read caching. Cache lines are first fetched from main memory and then updated in the cache. Case G, generate mode, is write caching without fetches on write misses for generate areas. For nongenerate areas, we use the normal mode (write allocate can also be used). To investigate the added complexities of performance with thrashing, we simulated two cache sizes 64k and 256k bytes. We also simulated systems with no external caches where the caching behavior of the i860 on-chip cache was modified to include cache generate. For the first level cache study, we use the same size cache as the i860, an 8k byte data cache.
To investigate the effect of write posting (a nonblocking operation) and replacement policies, we have also simulated two fundamentally different external caches, cache x and cache y. Cache x uses no write posting, no wraparound fills, and no posted replacement or other enhancements. The simplest control is used to see how these devices may have influenced the relative performance of CWG. Cache y (used in the implemented digital hardware of the UW-Proteus system) uses posted writes, wraparound fills, and posted replacements.
In all systems, on chip cache, secondary caches, and main memory operate at progressively slower speeds. Let t , be processor clock time. The secondary cache time is k t , and the main memory time is k,t,. In our simulations, for the secondary cache we have k, = 2. For the main memory we have three models, fast k , = 2(4), medium 
B. Workload
We benchmarked our cache variations with image processing applications using a mathematical morphology algorithm of bright feature detection shown in Fig. 1 . I is the input image operated on by the structuring element S E ( ) . Memory is used most efficiently by using temporary images Q and b, and processing in the eightstep program shown along the left side of Fig. 1 . Additionally, we optimized the algorithm to use a minimal amount of memory shown by the buffers labeled in Fig. 1 by I , U , b , and R. Buffer n is reused, which helps caching. Flushes are only necessary for the parallel version SPSD, below. To execute the task graph of Fig. 1 , it is partitioned using two types of parallelism. SPMD (nd variant), single programming multiple data, the data for each task is strictly partitioned; and SPSD (os variant), single programming single data, each function is computed by all of the processors. This uses finer grained sharing. Data are split for processing, each part given to a separate processor. For four processors, each processor works on 1/4 of the job. In the UW-Proteus system, we use 1M cache and 256 x 256 images. So, for 128 x 128 images, 256K cache was the chosen scaling.
SIMULATION AND RESULTS
We have grouped the results into three different cases: i) simulation results when the processor's on-chip cache model remains the same but the secondary cache uses different modes; ii) simulation results when only one level (on-chip) cache is used; and iii) the UW-Proteus system measurements where the secondary cache is programmed to use generate, allocate, or normal write caching.
A. Secondary Cache Results
With the mix of instructions given in Table I , the on-chip cache behavior of the i860 is the same regardless of secondary cache modes. For the external cache, allocate and generate give exactly the same hit ratios for reads and writes (see Table 11 ). Allocate and generate are differentiated by read and write miss penalties. This affects program performance, which can be seen through the number of bus cycles they use, the number of load stalls, and the run time.
B. Bus Cycles
The external cache uses generate to reduce the number of bus cycles to main memory. To illustrate, we present all of the cycles in the system for this program. The on-chip cache loads, cache stores, and instruction cache misses create read, write, line fill, and line flush requests on the bus outside of the processor. These requests are serviced by the external cache. Since the on-chip (i860) behavior for all modes in the secondary cache is the same, the number of external requests are the same for all three modes and these are summarized in Table 111 for the os program. For this program, there Table IV is the total number of external reads and writes to main memory by one processor. In a multiprocessor system with n processors, there will be n times as many bus cycles (external memory read and write cycles). Generate has fewer bus cycles than allocate or normal.
C. Load Stalls
In varying our external cache model, the number of stalled loads varies. A load stall is a memory load operation that cannot be satisfied in a single processor cycle, because the data is not available on chip, and/or there is bus contention. The 64k caches in all modes have 19852 stalled loads, or 18.38% of all loads. The 256k caches are large enough for no replacements and are more efficient than 64k caches. The number of stalled loads for 256k caches is N (= 18 424), A and G (= 17777). Because the generate mode writes are more efficient, fewer loads are stalled for that cache. Increasing the cache size reduces stalled loads by 7.19%. Increasing the cache size and using A or G reduces stalled loads by 11.67%, an improvement of 3.64% over 64k cache N mode. nd program speedup, generate versus normal and allocate, fast, allocate and generate perform the same using up to four processors. Beyond that, generate outperforms allocate due to fewer bus cycles and, hence, less contention on the bus. This shows that generate is an effective technique for increasing the number of processors on a shared bus. Generate performs better in comparison to the normal caching mode even when only four processors are sharing the bus. Tables V-VII) show the speedups with respect to the memory speed (f, m, or s) for different cache sizes (256K or 64K), di€ferent cache implementations (x or y) and different programs (os or nd). These figures show that generate improves performance by a greater amount when the memory is slower. This is as expected, because we are incrementally improving a small percentage of the program-the writes. Generate yields a significant speedup over mode A for all systems with slow memories in the single-processor case. For the more sophisticated y cache model, we achieve a greater 08 1.06 1.12 1.11 1.67 1.64 1.40 1.40 1.66 1.67 4.13 4.13  A/G 1.00 1.00 1.01 1.01 1.18 1.18 1.01 1.01 1.04 1.04 1 
D. Performance
Figs. 3-5 (and
E. Single-Level Cache Memo y
We also performed simulations using only one level on-chip cache by modifying the is60 cache to support generate. We did not change the size of the on-chip cache (it remained 8K for data cache and 41< for instruction cache), simulated two memory speeds, fast (f) and medium (m), and varied the number of processors sharing a single bus from one to eight. The speedups for the ad program are shown in Fig. 6 and Table VIII, and the run times are shown in Fig. 7 . Generate has a lower number of load stalls and external bus cycles than allocate. It is interesting to note that the normal mode performs best with one processor, but is taken over by generate with four processors and then by generate and allocate with eight processors. Generate achieves 17% speedup for the fast memory and 32% speedup for the medium-speed memory over normal caching when four processors are sharing the bus. The corresponding speedups over allocate are 22% and 27%. With eight processors sharing the bus, speedups achieved by generate are even better. These results, again, show how generate improves performance on a shared bus.
F. UW-Proteus Perj5ormance Results
Lastly, we ran the nd program for 256 x 256 32-b integer images and an optimized matrix multiplication assembly program to multiply two 256 x 256 floating point matrices on UW-Proteus. We used normal, generate, and allocate caching modes for the secondary cache with one and four processors. Measured speedups of generate over normal were 18% for the nd program. The speedup was only 0.8% for the matrix multiplication. The write frequency is less than 0.4% for the matrix multiplication, and not much speedup is as expected for one processor with the fast memory sytem used to implement UWProteus. Note that we implemented CWG in hardware and validated our simulation models. 
