major architectural change is coming that will greatly influence the work of computer architects and system software developers, and will consume silicon area in unexpected places. When I describe this change briefly to computer architecture researchers, they often get an incorrect impression of what I'm saying. I am not saying that system tools for translating object code among architectures will cause a proliferation of very different architectures (as we had until the last decade). Although that might happen, I am saying something different:
Instruction set architectures will evolve into families of what we would today call incompatible processors. Members of these families will be highly customized to specific application areas, but will maintain compatibility, in the sense that they'll run, with high performance, all the software distributed for their family.
These customized CPUs will offer large amounts of instruction-level parallelism with little hardware for performance or compatibility scheduling. They will contain specialized, architecturally visible memories. By "specialized," I mean they will be fitted to the data structures they typically store. By "architecturally visible," I mean they are part of the visible instruction set architecture and thus must be addressed directly by the programmer or compiler. Members of the same ISA family will not have large semantic gaps between them (as truly different ISAs have); rather, they will differ in the size, shape, and extent of their architectural elements. The transistor cost will be high, however, because they must support their specialized memories, hardware for a high degree of instruction-level parallelism, and rapid chip design.
A major enabler of these architectures will be what I call walk-time techniques, software translation techniques that change a program after the code has been distributed, typically while the code is being loaded or running. These techniques are static in nature, because a single change in the code may have its effect very many times (like changes made during compiling, assembling, or linking). "Walk-time" suggests that, like runtime, things are moving, but not quite as fast.
CROSSING THE ISA BARRIER
This prediction seems counter to the current line of thinking-that ISAs aren't going to change. This viewpoint is not without foundation. Ask yourself what computer company last met all three of these conditions:
• It was a new company or new to the computer business.
• Its computers were not compatible with any existing ISA.
• It made a significant dent in the general-purpose computer market.
I believe the answer is MIPS or Next, and the one still in business, Silicon Graphics, dates back to 1984. In taking an informal survey of companies that did cross the ISA barrier, I found that about three companies per year got through it at its peak in the early 1960s. From Kaufmann, 1990 ) told me in 1994 (paraphrased): "When Dave and I were revising our architecture book, people suggested we leave out all the stuff about ISAs. There aren't going to be any new ones."
AUTOMATED VARIETY
The current lack of ISA variety stems partially from automation and the advantages of interoperability. We've gone from one ISA per manufacturer to a form of standardization. The clothing industry is a good analogy. Once clothes were custom made; now there are N sizes of a given item, and you have to try to find one that fits. Unfortunately, a clothing manufacturer cannot decide to have 500 sizes instead of 50. The cost (manufacturing, inventory, distribution, and so on) is too high to accommodate such a change.
But paradoxically, automation can also add variety in the clothing industry. I see that Levi Strauss & Co. is now offering custom-fit jeans. You go to the store, they fit you up (using a computer), and they ship the jeans to your home. The inventory and display barriers are being crossed. We may soon reenter the world of tailor-made clothes.
In a similar way, walk-time techniques can offer variety within an automated framework. For example, family members might differ according to
• their visible memories, which could be built to match data structures in code;
• the number, word length, dimensions, latencies, repertoires, and interconnections of their functional units, registers, and memories; and • their instruction encodings and instruction cache placement and structure.
These changes, which are mostly architecturally visible, would today cause architectural incompatibility. Unfortunately, current walk-time techniques are not mature enough to offer solutions. We do use other techniques to handle the small amounts of incompatible customization we see today-for example, multimedia extensions to instruction sets, which first appeared in Hewlett-Packard's PA-RISC, 1 but are now in most modern architectures. Despite their incompatibility with the base architecture, their performance has been too great to ignore. Most systems handle the incompatibility with traps, branches to different code versions, or linked libraries. Someday a better system will still carry out these inefficient incompatibility solutions for immediate correctness, but then use a walk-time utility afterward. The utility will gradually replace the code with straight-line instructions that fit the processor. This will yield the full performance advantage of multimedia extensions.
The same idea can be applied to far greater levels of customization. Suppose a company wanted to build large volumes of a microprocessor that was specialized for a particular class of media applications, but that could still run normally distributed code-a common desire and becoming more so. These applications might use, in their most computationally intensive parts, 8-bit instructions, a 200-Kbyte table of known values, and no floating point. Such applications are likely to have abundant instruction-level parallelism, which is typical of multimedia applications.
In such an application, tailored memories and multiple architecturally visible register banks might combine to deliver four times the bandwidth and half the latency. In practice, these might yield enough instruction-level parallelism to go four times faster than an otherwise more capable, general-purpose microprocessor with a large data cache. It will be hard for the microprocessor market to ignore this advantage. Mature walk-time techniques will be able to spot a data structure that can be mapped into a special memory and put it there, locate 8-bit instructions to pack into multimedia instructions, and do all the required scheduling.
EMERGING TECHNIQUES
A few years ago, there was little research on and almost no practical application of these techniques, but this research is becoming increasingly popular. Recently, several types of walk-time techniques have emerged, although commercial applications are still scarce.
• Object code translation. 2 In commercial tools, this mostly involves the semantic gap between new and legacy ISAs. It's not very relevant for what I'm predicting here because the ISAs would be semantically similar.
• Dynamic compiling, optimization, scheduling, and register allocation. These are the most relevant walk-time techniques. This software deals with differing numbers of registers, differing instruction-level parallelism schedules, mapping data structures into special memory elements, optimizations carried out at runtime when more information becomes available, and so on. These haven't yet been commercialized, but there is much academic and industrial work under way. Object-oriented programming and Web-oriented, just-in-time compiling help motivate work in this area. Also garnering attention is statistical profiling during runtime, which is aimed at dynamic recompiling. While not restricted to walk-time techniques, another important class of problems is the building of compiler back ends flexible enough to generate very highperformance code for an entire family of different processors. The known algorithms for generating code with a lot of instruction-level parallelism are very intricate. Making these algorithms work when the underlying architecture semantics can change dramatically is very hard, and progress has been slow. T he previously unpopular view that you would have to do "heroic" compiling for ambitious superscalars is now conventional wisdom. Similarly, I believe that to keep growing single-processor performance, we'll have to add and subtract functional units, special memories, registers, and customization in a way that the current compatibility model can't live up to, and in a way that compiler technology can't yet handle. Compilers will continue to be critical components in improving performance in any case, as the sidebar "Where Does Compiler Technology Fit In?" describes.
Predictions of customization apply to embedded processors even more than to general-purpose ones. Processors built for narrow, high-volume uses will be customized exactly to the single program running on them. Flexible code generation methods will be important both for tool set development, and because users will want to rely on interoperability within a processor family as they move from one customized chip to another.
Some people believe that, except in embedded processors, there isn't enough instruction-level parallelism to make customization and the effort to develop walk-time techniques worthwhile. But I doubt they can defend that position to any greater degree than I can defend my belief that most computation-intensive applications potentially have a lot of instruction-level parallelism. One thing is clear: Even in generalpurpose computing, the current focus on multimedia yields more instruction-level parallelism than we know what to do with. And it seems unlikely that this trend will reverse any time soon. 
Where Does Compiler Technology Fit In?
The flexible back ends and walk-time techniques described in the main text underline a major change in the last 10 to 15 years: Compiling will clearly continue to be central in high-performance computing.
At first, compilers played a small role in riding the performance curve offered by silicon speed and mainstream architecture. But now performance that results from exploiting parallelism is necessary, and parallelism, at any level, seems to require compilers that expose and use that parallelism. During the 1980s, vectorizing compilers matured, and experimentation was just beginning with what have now become the current major techniques for finding instruction-level parallelism. Most research now targets processors that provide both fine-and coarsegrained parallelism. Thread-level parallelism is popular, but the relevant compiler techniques are still immature, and it is not clear that practical applications possess significant thread-level parallelism, or that compilers will ever be capable of finding it when they do.
For both thread-level and multiprocessor parallelism in general, restructuring transformations that expose parallelism have been hard to find. Most compiler progress has been in changing the characteristics that have slowed easily parallelizable code-for example, optimizing cache behavior, removing false sharing and cache conflicts, and blocking. These are picking away slowly at the problem, but for a large body of code, these techniques will produce smaller speedups than desired, no matter what we do.
