Tera-scale computing stresses the platform architecture with memory bandwidth being a likely bottleneck to processor performance that presents unique challenges to CPU packaging. This paper describes the evolution in packaging technology with each processor generation to meet increasing memory bandwidth needs and the revolution in package technology required for tera-scale computing needs. The scope and focus of the paper are primarily design and electrical performance challenges. We discuss a potential roadmap of transitions in package architecture and technology that evolves from today's offpackage memory scenario to increasingly complex onpackage integrated memory architectures. An overall treatment of memory hierarchy, including off-die memory approaches, is not within the scope of this paper, but relevant to the overall challenge of enabling higher bandwidth. Again, the focus of this paper is on the CPU package itself. In this context, we discuss the memory bandwidth limitations, technology challenges, and tradeoffs of each package architecture.
INTRODUCTION
With a potential transition to tera-scale computing with multi-and many-core microprocessors and integrated memory controllers on the CPU, memory bandwidth becomes a bottleneck to processor performance [1] . This presents unique challenges to CPU packaging. Previous memory bandwidth requirements have scaled steadily, but fairly slowly, from one microprocessor generation to the next. This has driven a fairly steady but slow increase in pin count growth for chipset packages, which have traditionally provided the link to system memory between the microprocessor and memory modules. With a transition to multi-and many-core architectures, however, there is a large increase in the memory bandwidth requirement. This transition occurs at the same time as a shift to an integrated memory controller architecture for the CPU. These fairly simultaneous architecture transitions result in a tremendous burden on CPU packaging requirements, driving pin count growth and driving up routing density due to the large increase in interconnects that must be routed from the CPU through the package to off-package memory modules.
In this paper we describe the evolution in packaging technology with each processor generation to meet increasing memory bandwidth needs. We focus on the revolution in package technology required for tera-scale computing needs. The scope and focus of this paper are primarily design and electrical performance challenges. We propose a roadmap of transitions in package architecture and technology that evolves from today's offpackage memory to increasingly complex on-package integrated memory architectures. We discuss the memory bandwidth limitations, technology challenges, and tradeoffs of each package architecture.
In the first section of this paper we look at memory bandwidth fundamentals. Next, we review the past trends in memory bandwidth requirements and the package technology impact. We follow this with sections describing the memory bandwidth needs for tera-scale computing and the resulting package technology impact and response.
MEMORY BANDWIDTH FUNDAMENTALS
It is useful to review several fundamental concepts as an introduction to the topic of memory bandwidth. First, it is important to understand the definition of memory bandwidth, the key elements related to bandwidth, and the role that the package interconnect plays. Very basically, memory bandwidth is defined as the product of the Besides the actual memory bandwidth, other key elements of memory bandwidth are latency and capacity. Latency is the roundtrip time that it takes to receive a response after a request has been sent. Latency is typically measured in nanoseconds (ns). Capacity refers to the size of the memory and is typically measured in MBs.
The memory subsystem hierarchy of a computer architecture consists of many levels. Memory can be located at the chip level, the package level, the board level, and in separate devices off the board (such as the hard disk). There is a tradeoff among the types and the key elements of memory (bandwidth, latency, and capacity) depending upon the location in the memory subsystem hierarchy. Very simply, faster, lower capacity memory is typically located on-chip, while slower, higher capacity memory is located off-chip. On-chip memory usually uses Static Random Access Memory (SRAM) technology, which is fast but expensive, and it is lowdensity compared to other memory technologies. On-chip memory usually serves as a cache and can be further divided into levels of cache, e.g., L1 cache, L2 cache, etc., [2] . Off-chip memory typically uses Dynamic Random Access Memory (DRAM) technology, which is slower but cheaper, and it is higher-density than SRAM. Off-chip memory located on the system board serves as the main memory for the computer system.
Today's typical computer architecture consists of the microprocessor (CPU), the chipset, and the main memory. Busses connect the various components of the system. Figure 1 illustrates a typical system architecture consisting of a microprocessor connected to a chipset through the system bus. The chipset in this example is divided into a Memory Controller Hub (MCH) and a separate Graphics Processing Unit (GPU). Each has a memory bus connecting to on-board memory. The system bus connects the CPU to the on-board, main system memory. In this example, there are potential bottlenecks at each interconnect transition with respect to providing memory bandwidth to the CPU. Specifically, there is a transition from the CPU to the MCH through the system bus; and there is a transition from the MCH to the system memory through the main memory bus. The challenges to memory bandwidth in this traditional architecture have been to increasingly scale the capabilities of both the system and main memory busses to keep up with the steadily increasing memory demand of the CPU with each new generation. Figure 2 illustrates the historical trending of the system bus bandwidth capability vs. the memory bandwidth of the system. It makes sense that the two bandwidths have needed to scale simultaneously for optimum system performance. Scaling of bus capability has usually involved a combination of increasing the bus width while simultaneously increasing the bus speed. the much coarser pitch and low-density features of the system board. The ability of packaging technology to serve as an intermediary interconnect bridging the gap between the chip and the system board has been critical to enabling increasing system memory bandwidth in the past. Packaging technology will continue to play a critical and increasingly important enabling role as we transition to tera-scale computing architectures.
REVIEW OF PACKAGE TECHNOLOGY EVOLUTION VS. MEMORY BANDWIDTH REQUIREMENTS
In the traditional system architecture of Figure 1 , the packaging challenges associated with ever-evolving and increasing memory bandwidth impacted both chipset and CPU packaging. The chipset package has usually absorbed the need for increasing numbers of connections to system memory, while the CPU packaging dealt primarily with the need for an interconnect that could support increases in the system bus speed. Two almost simultaneous system architecture transitions are collapsing the focus of the memory bandwidth packaging challenges to primarily CPU packaging.
The first transition is the transition to an Integrated Memory Controller (IMC) in the CPU. Figure 3 illustrates the shift in the system architecture from that of Figure 1 . The system memory now interfaces directly to the CPU through a system memory bus. The entire burden of pin count and interconnect speed to sustain increases in memory bandwidth requirements now falls on the CPU package alone. The second system architecture transition is the potential move to multi-and many-core CPU architectures. With the increase in cores comes a dramatic transition in memory bandwidth to "feed" the cores, particularly for the class of parallel applications envisioned. Since the CPU package is now the primary interconnect to the system memory, the CPU package bears the burden of the increase in memory bandwidth. Figures 4 and 5 illustrate the transition in memory bandwidth requirements and the projected CPU package pin count growth, respectively, as the transition to multi-and many-core CPUs with integrated memory controllers occurs. One solution to address the bandwidth challenge and to stave off the continued increase in package pin count is to incorporate memory into a CPU + Memory Multichip Package (MCP). This changes the paradigm of packaging and the role of packaging in the memory hierarchy. Whereas packaging has previously been an enabler of higher bandwidths, now packaging would become a crucial sub-level in the overall memory system hierarchy.
Moving to on-package memory, however, is not a trivial solution to implement. In the final main section of this paper, we discuss the challenges and benefits/capabilities of an MCP configuration and various MCP architectures.
First it is beneficial to discuss tera-scale computing challenges in the area of memory bandwidth. Figure 6 shows the historical trend for memory bandwidth demand [3] . Today's bandwidth demand is in the 10-20GB/s range. From Figure 4 it is obvious that the move to multi-and many-core computing will easily drive a need for bandwidth in the 100s GB/s range in the not so distant future. Extrapolating from this, a target of 1TB/s of memory bandwidth for tera-scale computing architectures is not unreasonable [3] [4] [5] . Re-architecting a system capable of delivering this level of bandwidth is a challenge, given that the traditional methods are already reaching realistic limits. On many microprocessors, SRAM already occupies approximately half of the die real estate [4] . Increasing the amount of ondie memory becomes prohibitive from a cost and die size growth perspective. Increasing the bandwidth of boardlevel memory becomes prohibitive because of the continued increase in pin count and power with interconnect speed required to sustain bandwidth increases. Using on-package memory becomes a potential attractive intermediary level in the memory hierarchy that can work with chip-level and board-level memory to provide the bandwidth for tera-scale computing applications.
TERA-SCALE COMPUTING MEMORY BANDWIDTH CHALLENGES FOR PACKAGE TECHNOLOGY

PACKAGE ARCHITECTURES TO MEET THE MEMORY BANDWIDTH NEEDS OF TERA-SCALE COMPUTING
To meet the memory bandwidth needs of tera-scale computing there needs to be an evolutionary transition in package architectures and technologies. In this section we discuss three architectures in detail: the CPU + memory 2D planar MCP, a package substrate embedded memory + CPU MCP architecture, and a 3D CPU + memory stacked die MCP. Each of these package architectures has benefits and challenges associated with the technologies. The memory + CPU 2D planar MCP is the most straightforward to implement and can be implemented with today's packaging technologies. There are capability limits on the amount of additional memory bandwidth this architecture can provide, however. Embedding memory in the package is the next evolutionary step in enabling higher memory bandwidth at the package level. This potentially enables higher bandwidths than the memory + CPU 2D planar MCP but requires technology development and comes with integration challenges. The final architecture, 3D stacked die memory + CPU MCP, potentially enables bandwidth capability surpassing the previous two architectures, but requires much technology development and has significant integration challenges. Given these tradeoffs among the architecture challenges and their bandwidth capabilities, a transitional package technology roadmap makes sense and is proposed in Figure 8 .
For the remainder of this section we provide details on the capabilities and challenges of each of these on-package memory architectures. For the purposes of this discussion, we assume a memory technology that is capable of delivering the bandwidth targets in terms of data rate and connection density that will be discussed. A discussion of the memory technology details is outside the scope of this paper. Also, the focus of our discussion is on memory bandwidth. Memory capacity and latency also play a key role in CPU system performance, but are outside the scope of this paper. The first transition to on-package memory that can be implemented in the most evolutionary manner with respect to today's package technology is the memory + CPU 2D planar MCP. Intel has used 2D planar MCP packaging in the past and continues to use MCP package technology today for many of its multi-core processors, so this package technology is not a revolutionary technology. There are unique challenges associated with a CPU + memory 2D planar MCP that realistically limit its bandwidth capabilities.
The key challenges with a CPU + memory 2D planar MCP are that heterogeneous die are being assembled onto a single-package substrate with a requirement of optimizing performance to achieve a very fast, dense memory bus interconnection scheme. There are both design and electrical performance challenges associated with this architecture. Key design challenges are form factor fit, die placement, and routability.
In general, bump pitch between the CPU die and the memory or DRAM die will likely not match. This leads to routing issues that do not enable the design to take full advantage of line/space density capabilities of the package technology. Figure 9 illustrates a typical scenario when trying to route 200 I/Os between two die on an MCP substrate. Single-layer routing becomes impossible because of cut-off of the routing lanes. The solution is to use two-layer routing. This results in an increase in package cost and challenges in performance resulting from the division of the memory bus into two layers of routing.
Single-layer Routing
Routing channel is cut off, or blocked.
Two-layer Routing
Routing channel is clear.
Trace lengths are mismatched.
Trace lengths are matched.
Single-layer Routing
Two-layer Routing
Figure 9: Design and routing challenges with onpackage memory 2D planar MCP
In addition to routing challenges, there are challenges in large numbers of I/Os escaping from each die due to bump pitch constraints. Table 1 summarizes results for 200 and 400 I/Os. While routing a memory bus with 200 I/Os is fairly scalable with reasonable package technology and bump pitch routing capability scaling assumptions, increasing this to 400 I/Os becomes challenging. Signal integrity issues in an MCP configuration also lead to performance challenges. Because there is still a substantial amount of trace length that is routed on the package between chips and these are routed very densely, crosstalk limits performance. For a single-ended configuration, the upper limit is ~6-7Gb/s. Results of a signal integrity sensitivity study are summarized in Combining the limits introduced by routing, fit, and signal integrity challenges, an estimate on the maximum sustainable bandwidth of a 2D planar MCP configuration is 100GB/s-200GB/s, depending upon the number of memory chips placed on the MCP. This also assumes a transition in memory technology to enable the types of connection densities and memory speeds used in this study. While this is a substantial amount of memory bandwidth capability, it is still not sufficient to meet the ultimate targets for tera-scale computing in the long term.
The next transition in package technology that can enable higher memory bandwidth than the CPU + memory MCP is a package embedded memory architecture. Figure 12 shows a schematic of this package architecture. This type of package architecture eliminates one level of transition between the CPU die and the DRAM die, i.e., the routing, which is responsible for the crosstalk that limits the ultimate I/O speed of the CPU + memory MCP architecture. This conceivably enables higher bandwidth by providing a very short and direct CPU-to-DRAM interconnect. Signal integrity simulations for a typical CPU-to-DRAM interconnect using a substrate embedded DRAM revealed very minimal impact due to crosstalk since the die-to-die connections are separated by an appreciable distance, equal to at least the bump pitch. The model used for the signal integrity studies is shown in Figure 13 . At 4Gb/s, the substrate embedded memory configuration resulted in at least 30-40% more margin than the CPU + memory 2D planar MCP configuration results shown in Figure 10 . Extrapolating from these results, it is conceivable that the substrate embedded memory architecture can easily achieve a bit rate of at least 10Gb/s. Given that direct connections between the CPU and memory can be made at the same pitch as the die bump pitch, hundreds of connections can be enabled in a small area. Consequently, this architecture can enable a memory bandwidth in the 200GB/s-1TB/s range. One challenge with this type of architecture is the thermal performance with an embedded die. Preliminary thermal modeling and historical data suggest that a limit of approximately ten watts or less for the embedded device should be maintained to avoid excessive refresh rates and increased power penalties. There are also substantial integration challenges with this architecture. This is, however, an intriguing architecture for enabling bandwidths in the 200GB/s-1TB/s range.
DRAM
To enable memory bandwidths beyond 1TB/s, the 3D CPU + memory stacked die MCP architecture becomes interesting. Because this will provide the shortest possible interconnect between the CPU and memory die, the bit rate will far exceed 10Gb/s. In addition, the interconnect density will scale to enable thousands of die-to-die interconnects. Intel recently demonstrated a single-chip teraflop processor, Polaris, with 80 cores. Polaris contains hooks for stacking a separate SRAM chip, Freya, in a 3D configuration [6] and [7] .
SUMMARY AND CONCLUSIONS
Memory bandwidth is one of the key challenges associated with tera-scale computing. Package technology will play a key role in answering that challenge. In this paper, we reviewed the historical trends in memory bandwidth and their impact on package technology. Key challenges in the past have been a continuing pin count growth that leads to package body size growth as well as design challenges. In addition, there are fundamental limits to the off-package memory bus speed that can be supported. A transition to on-package memory is necessary for supporting tera-scale computing needs. We reviewed a transitional, evolutionary roadmap for package technology to implement on-package memory architectures capable of meeting the needs of tera-scale computing. Table 2 summarizes the technology and architecture options, features, capabilities, and limits of each. 
