HE concept of cache memory has emerged as a solution for the ever increasing time domain gap between processor technology and memory technology. Since the very early works of Wilkes [13] , the concept has evolved into a sophisticated system of hardware-implemented and software-implemented solutions. Actually, the best performance/complexity ratio is obtained through a synergistic interaction of hardware-based and software-based solutions.
Guest Editors' IntroductionCache Memory and Related Problems: Enhancing and Exploiting the Locality Veljko Milutinovic, Senior Member, IEEE, and Mateo Valero, Member, IEEE ----------F ----------HE concept of cache memory has emerged as a solution for the ever increasing time domain gap between processor technology and memory technology. Since the very early works of Wilkes [13] , the concept has evolved into a sophisticated system of hardware-implemented and software-implemented solutions. Actually, the best performance/complexity ratio is obtained through a synergistic interaction of hardware-based and software-based solutions.
The efficiency of the caching system is achieved through appropriate exploitation of the principles of temporal and spatial locality. Traditionally, temporal locality means that the probability is relatively high that a data or an instruction item will be reused in the near future. Spatial locality means that the probability is relatively high that the next data or instruction item to be used is in some way neighboring the previously used data or instruction item.
In traditional systems, temporal locality is exploited by keeping some of the most recently used data/instructions in the cache memory and by incorporating the cache hierarchy. Spatial locality is exploited by using larger cache blocks and by incorporating the prefetching mechanisms into the caching system. As technology gets more and more sophisticated, it has become obvious that a much better performance can be achieved through the incorporation of more sophisticated solutions for enhancing and exploiting of the locality present in the code or data.
As microprocessors get more and more complex, cache design and performance become more and more impacted by the solutions utilized in other domains, like superpipelining, superscaling, multithreading, prediction, parallelization, etc. Implementation issues in modern microprocessor systems are getting new dimensions. The issues of most interest for cache designers are treated in "Implementation Issues in Modern Cache Memories" by Jih-Kwon Peir, Windsor Hsu, and Alan J. Smith, while the impacts of multithreading on cache performance are treated in "Effects of Multithreading on Cache Performance" by Hantak Kwak, Ben Lee, Hurson Ali, Suk-Han Yoon, and Woo-Jong Han.
Optimal local memory performance is investigated in "Investigating Optimal Memory Performance" by Olivier Temam. It is important for the designers to know the theoretical limits before they can concentrate on their own ideas.
As indicated above, it has become obvious that more sophisticated approaches to locality exploitation are needed. Two early attempts imply the approaches by which the temporal and the spatial localities are handled by separate cache systems [2] , [5] ; this is in contrast to the traditional approaches by which the temporal and the spatial localities are treated using unified resources. The so-called split temporal/spatial cache approach can be implemented predominantly in hardware domain, predominantly in software domain, or in some combination of the two. Separate cache memories are maintained for data with a predominantly spatial locality and for data with a predominantly temporal locality. In its simplest form, hardware design parameters in two subsystems are tuned to the type of locality to be exploited and compiler helps with data classification. In its more sophisticated forms, only the temporal part includes the hierarchy and only the spatial part includes forms of prefetching, with data being able to migrate between the spatial and the temporal parts, with or without the assistance of the system software. More recent approaches explore an even wider plethora of possibilities [3] , [8] , [10] , [12] .
Systems with unified treatment of different locality types still prevail and can be classified into a number of correlated categories. Some of them focus on the final goal through appropriate cache architecture and design innovations, with no or no major compiler modifications (trace caching, victim caching, and randomized caching represent important new contributions In multiprocessor systems of the SMP (shared-memory multiprocessors or symmetric multiprocessors) and the DSM (distributed shared memory) types, in addition to the traditionally defined locality types (temporal and spatial), a number of additional locality types are present and can/should be exploited (processor locality, cache consistency maintenance locality types, memory consistency modeling locality types, etc.). Appropriate mechanisms are incorporated in order to maintain the cache and memory consistency. More background information on issues of importance can be found in [11] , [7] . All of these mechanisms represent a necessary system overhead, but also a source of system level information which can be utilized for potential performance improvements. In SMP and DSM systems, exploitation of sophisticated locality types is more implicit than explicit.
State-of-the-art research in scalable shared memory multiprocessor systems concentrates on two major research avenues: 1) performance evaluation and sophisticated verification aimed at better understanding of the potentials and ways in which different multiprocessor level localities can be exploited, and 2) architecture innovations aimed at better exploitation of different multiprocessor level localities.
Early multiprocessing-oriented research tried to exploit traditional locality types in the new multiprocessor context. For example, it has been found that entry memory consistency models may work better where the temporal locality prevails in the code, while the lazy release memory consistency models may work better where the spatial locality prevails [1] . Various locality types are also treated in [6] .
More recent multiprocessing-oriented research tries to exploit the locality types inherent in the multiprocessing environments more directly (and nonexistent in the uniprocessor environments). For example, data-forwarding, remote-write, or cache-injection try to place data local to what is expected to be the next data processing site; for this purpose, appropriate hardware, software, or combined techniques can be used [4] , [9] .
As already indicated, evaluation of performance potentials is an important on-going research avenue. Popular protocols are analyzed in various environments and for various applications in "A Quantitative Analysis of the Performance and Scalability of Cache Coherence Protocols" by Mark Heinrich, Vijayaraghavan Soundararajan, Anoop Gupta, and John Hennessy. It has been found that the achieved performance and the optimal protocol change for different applications; protocol and architecture tuning to specific locality types, typical of the application, is absolutely necessary. Popular models are analyzed in the context which permits additional optimizations possible in the ILP (Instruction Level Parallelism) systems in "The Impact of Instruction-Level Parallelism on the Memory System Performance" by Vijay S. Pai, Parthasarathy Ranganathan, Hazim Abdel-Shafi, and Sarita V. Adve. It has been shown that additional optimizations lower the performance difference of various memory consistency models; this is because ILP optimizations equalize the locality patterns in typical code. An important prerequisite for further research is the existence of formal verification tools such as the one in "An Executable Specification and Verifier for Relaxed Memory Order" by Seungjoon Park and David L. Dill. Having such tools in hand, one can experiment with different models and how they behave in specific applications characterized by specific locality types.
Recent architectural research is characterized by numerous ideas; this fact indicates the prosperity of the field. The approaches which deserve special attention are given in "Excel-NUMA: Toward Programmability, Simplicity, and High Performance" by Zheng Zhang, Marcelo Cintra, and Josep Torrellas, "Coherence Controller Architectures for Scalable Shared Memory Multiprocessors" by Maged Michael, Ashwini K. Nanda, and Beng-Hong Lim, and "Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation" by Donglai Dai and Dhabaleswar K. Panda. The paper by Zhang et al. introduces the Excel-NUMA approach, which enhances programmability by utilizing the fact that, after a local memory line is written by a processors, the memory location containing the line remains unused and can be used for temporary storage of remote data displaced from local caches. The paper by Michael et al. analyzes various coherence controller architectures and suggests solutions based on the proper characterization of communications patterns and statistics. The paper by Dai and Panda tries to exploit the benefits of the multiple-path networks for the best performance and performance/complexity in DSM systems, and proposes the novel block correlated FIFO channels approach to detect and prevent all potential coherencesensitive race conditions.
In conclusion, this overview effort tries to shed more light on the ongoing cache research in both the uniprocessing and multiprocessing arenas by pointing to a common new thread which is aimed at intensifying and exploiting different locality types present explicitly or implicitly in the application code or data [2] , [5] . It is strongly believed that efficient treatment of locality issues can help achieve a significant improvement in performance and performance/complexityratio domains [5] .
ACKNOWLEDGMENTS
The special issue editors are thankful to Professor Roger Espasa of the Universidad Politecnica de Catelunya for processing the volume of papers submitted to this special issue and for handling most e-mail correspondence with the special issue authors, and to Dr. Aleksandar Milenkovic of ETF for his helpful suggestions. The total number of papers received was 63 and the total number of evaluations generated by 160 reviewers was 233. In such conditions, due to the limited space, a number of excellent papers had to be rejected.
