.
Introduction
The CICA approach has been defined as a method to minimize the probability and the negative effects of the read miss, in SM and DSM systems. This paper compares existing solutions (which can be synthesized &om features supported by the present day microprocessors) and variations of the proposed solution (which can be synthesized only with the newly proposed cache injection mechanism). This general idea was first introduced in [l]; however, it has been considerably improved through the research presented here [2] . 
Cache injection/cofetch strategies
The proposed cache injectiodcofetch architecture includes three scenarios, in the context when one processor node is a data producer and one or more processor nodes are potential data consumers: (a) one or more consumers express a potential need for certain data (by executing appropriate code generated at compile time), and injection is initiated by producer (on write-back); (b) one or more consumers express a potential need for certain data, and injection is initiated by a consumer (on first consumer read); (c) no consumer is responsible to express the need for certain data; however, a producer (based on its code generated at compile time) forcefully injects the potentially needed data into the cache memory of one or more potential consumers. The term cofetch stresses the fact that the above described injections can be done concurrently at various processor nodes in the system. In this paper, the first two scenarios have been further elaborated; the third one is the subject of a follow-up work. Figure 1 . includes all necessary descriptions and explanations for the existing solutions: (a) classical (processors do not include data prefetching), and (b) prefetching (Pm. Figure 2 . includes all necessary descriptions and explanations for the cases: (a) cache injection on producer write-back, and (b) cache injection on first consumer read. In the first case, the data producer initiates the write-back command to force the block back into the memory; at the same moment, all processors that have estimated that the related block will be needed, will inject the block into their respective caches (the IAT table contains the addresses of data estimated to be needed). In theory, estimation can be done either by the compiler or the programmer. If the injection destination is the cache -there may be implementational problems. Consequently, an intermediate buffer is needed. Later, data can be automatically transferred from this buffer into the cache, if the estimation was proved to be correct and the data are really needed. In the second case, the first of the consumers will not find the needed data in its own cache; however, other consumers will, due to the 0-8186-7758-919'7 $10.00 0 1997 IEEE mds location X, and third P3 reads location X. Explanation: (a) The PI *ore instruction causes a read-exclusive bus operation, to retrieve th( :xclusive copy o f the cache line. The P2 load instruction initiates a rea( )us operation, P1 snoops the line and sees that it has the block duty, a n c P1 flags this using the dnty hie on the bus, which disables the m e m q md enables it to drive the data, PI then transitions to a shared state along uith P2, which reads data. The memory controller also triggers a m e m q mite in order to update main memory. The P3 load instruction is trans ated into a read bus operation. Processors P1 and P2 perf-snoop, bu hey have no dnty copy o f data; they do not interact with this read; (b The sequence of instructions which is considered above has been extendec with prefetch instructions; P2 and P3 insert prefetch instructions to pre fetch data which is expected to be used. Implication: 'a) I PI --wait time=Tm, PZ-wai-time=Tm, P3-wait-time=T~~;
Bus operations: bus-read-exclusive + 2*(bus_read);
3) PI-wait time=Tm, F' 2 wait time=Tm, P3waittime=Tm;
Bus operati&: bus-read-<clus&e + 2*(bus_read). above described action of the first consumer. An important issue here is the state of the arriving block (it must arrive as a read-only block). This is of relevance for the first case, as well. If more consumers are involved with a right to write, only one is to obtain some type of exclusive rights, which is the subject of a follow-up research (others have to stop injecting in that case, in a predefined way).
Initial performance evaluation
Initial performance evaluation is performed using synthetic address traces for a bus based multiprocessor. The major goal of the simulation analysis is to compare performance of the existing (Classical, PF) and proposed (CI, CI+PF) solutions. Execution time and bus traffic are used as performance measures. Application of interest is a) injection on write-back b) injection on$rst consumer read Fable. Description: Instead of a prefetch instruction, P2 and P3 issue the Drat instruction, which puts the address of the data that is expected to be lsed in the IAT. Explanation: (a) When processor P1 issues a write-back mtruction, processors P2 and P3 catch the data, and put it into thei~ xches; (b) In this case, processor P1 (pruducer) does not issue the writc Jack instruction. When processor P2 (first consumer) reads data, proces. $or P3 (second consumer) will catch the data, ifthe address of that data i 8 u1 the IAT o f the processor P3. Implieation: :a) Pl-Wait-timsTwM, PZ_wait-time=T~~, P3_wait_time=T~;
Bus operations: bus-read-exclusive + bus-write; F) Pl-wait_time=Tw, PZ-wa&time=Tm, P3_wait_time=T~; Bus operations: bus-read-exclusive + bus-read.
described through a set of workload parameters. Details of workload, software, system architecture, conditions, assumptions, and related references for each solution are given in [2] .
The results of simulation show that each suggested solution can contribute to performance, but the right combination of prefetch and injection mechanism is the winning solution for a number of different applications of interest. The performance benefit for the solution which combines prefetch and injection approaches is between
