This paper describes a recent system-level trend toward the use of massive on-chip parallelism combined with efficient hardware accelerators and integrated networking to enable new classes of applications and computing-systems functionality. This system transition is driven by semiconductor physics and emerging network-application requirements. In contrast to general-purpose approaches, workload and network-optimized computing provides significant cost, performance, and power advantages relative to historical frequency-scaling approaches in a serial computational model. We highlight the advantages of on-chip network optimization that enables efficient computation and new services at the network edge of the data center. Software and application development challenges are presented, and a service-oriented architecture application example is shown that characterizes the power and performance advantages for these systems. We also discuss a roadmap for next-generation systems that proportionally scale with future networking bandwidth growth rates and employ 3-D chip integration methods for design flexibility and modularity.
Introduction
Over the past 30 years, several important inflection points in the evolution of computing systems have occurred that were primarily driven by limitations set by semiconductor technology. At each major technology transition, unsustainable power dissipation was the primary motivator for change. The transition from bipolar to CMOS in the late 1980s, which fundamentally changed computing systems, was necessitated by high-power dissipation of bipolar technologies and the chip-cooling challenges. From a packaging perspective, this transition reduced the module heat flux (transference of thermal power) by an order of magnitude and drove fundamental changes in system design. At the same time, this technology transition was accompanied by a reduction in single-thread compute performance that was subsequently improved by orders of magnitude through CMOS scaling in successive technology generations.
As CMOS generations scaled with respect to feature size, power dissipation for high-performance single-threaded systems increased to a level that necessitated the transition to multicore architectures [1] . Multicore architectures explicitly tradeoff high single-thread performance at higher clock frequencies for parallelism achieved through the use of many lower single-thread performance cores at lower clock frequencies and reduced power. This fundamental shift from increasing single-thread performance to increasing degrees of parallelism adds significant challenges to software development, as applications can no longer rely on simple frequency scaling without modification. We anticipate that the next inflection point in the evolution of computing systems will again be driven by power dissipation limits, with 3-D integration packaging methods [2] becoming core technologies to provide systems integration and design modularity while preserving computational density (e.g., the number of computations performed per unit time and per unit area) and performance.
For a variety of application spaces, multicore designs provide a distinct advantage in terms of their flexibility to address power dissipation, performance, and computational throughput. Applications with low dependence on single-thread performance have the highest potential for improved performance through parallelism. Applications with lower memory and I/O requirements will be advantaged in terms of price and performance, as memory costs are a dominant component of overall system costs. Building on this theme of application characteristics and the ability for applications to make use of multicore architectures leads to the notion of workload-optimized systems.
Workload-optimized systems are systems with tailored hardware capabilities and acceleration methods, combined with application and systems software optimization for a desired power, performance, and throughput tradeoff. Such systems explicitly cooptimize hardware and software for desired applications. For these systems, in addition to the advantages of multicore designs and the ability to utilize parallelism at the chip level, there is a compelling advantage for targeted hardware accelerators. Such accelerators are invoked to efficiently optimize throughput and power, utilizing dedicated hardware rather than software running on general-purpose microprocessor cores. Examples are cryptography engines for security, compression, and decompression processing for data reduction, Extensible Markup Language (XML) processing for service-oriented architecture (SOA) [3] applications, regular expression pattern matching, and Transmission Control Protocol/Internet Protocol assists for network operations.
Another fundamental concern associated with workload-optimized systems is the impact of networking on performance and throughput. There are broad classes of applications for which higher level components at the transport and application networking stack layers directly impact overall performance. At the same time, the lower physical layers are Bunaware[ of the system specifics and application performance. The ability to consolidate networking functionality for performance, in tight integration with multiple cores and targeted accelerators, forms a powerful basis for next-generation network-optimized systems. Work by various companies and universities in the area of network-optimized computing is described in the literature [4] [5] [6] .
The remainder of this paper is organized as follows: We first explain the motivation behind the IBM wire-speed processor (WSP) and describe its high-level architecture that uses massive multithreading (MMT; the term many cores is also common) in conjunction with hardware accelerators for power efficiency and high throughput. Next, we present use cases for network-optimized computing that utilizes WSP in a data center or cloud computing scenario, and we highlight the software development issues introduced by the heterogeneous combination of MMT and hardware acceleration. We then demonstrate the advantages of MMT and accelerators through a detailed example representing SOA-based applications. We conclude with a projected roadmap of the 3-D chip technology that may be used to ensure the scaling and design modularity of future network-optimized systems.
WSP for network-optimized computing
Historically, there has been a distinct separation between the networking and server domains. This partitioning has been reflected in the optimization of network processors for lower layers 2-4 of the open system interconnect (OSI) networking stacks and in the employing of hardware acceleration to maintain line-rate processing. At the other end of the spectrum are general-purpose server processors, which have a rich instruction-set architecture and programming and tools environment, but that require higher power and carry the overhead necessary for providing a general-purpose processing infrastructure.
Network-optimized computing enables application processing at networking speeds by unifying networking and server functionalities. Network-optimized applications are constrained by ingress and egress data rates, latency and throughput requirements, and the temporal or streaming nature of the data. As the performance of network interconnects increases from current 10-Gb/s rates and approaches 40 and 100 Gb/s in the future, network optimization requires a new generation of processor that can meet the latency and bandwidth needs while optimizing overall power dissipation. Important attributes of these converged systems include MMT, integrated networking support, and hardware accelerators in a balanced system configuration with a robust instruction-set architecture and programming environment.
The IBM WSP [7] consists of 16 embedded 64-bit PowerPC* cores running at up to 2.3 GHz. Each core supports four simultaneous hardware threads that feed a single issue in-order pipeline. Each group of four cores shares a 2-MB L2 cache for a total of 8-MB L2 cache interconnected over an internal coherency bus. The WSP chip integrates four 10-Gb/s and two 1-Gb/s Ethernet interfaces known as the host Ethernet accelerator (HEA) onto the chip. The HEA provides for multiqueue network traffic classification and OSI packet layer 2-4 parameter extraction. Four WSP chips can be joined together to form a 64-core coherent system with a maximum single-chip power consumption of approximately 85 W and average power consumption of approximately 70 W.
WSP is optimized for throughput and power consumption. When compared with server processors, these requirements manifest themselves through a reduced clock frequency, simpler in-order pipelines, and high thread counts per core. However, many network applications perform inline functions and are constrained by a required response time or QoS. Therefore, it is important that reduction in single-thread performance and focus on throughput does not degrade application performance in the presence of network service latency. WSP integrates on-chip acceleration units for cryptography, compression and decompression, regular expression pattern matching, and XML processing. These accelerators are attached to the WSP internal coherency bus.
Accelerators provide significant power and performance advantages by directly implementing in hardware functions that are expensive to execute on the MMT cores. As detailed in [8] , the efficient processing of networking applications requires computing nodes that together optimize server and networking functionality with hardware accelerators. Application layers and network functionality are combined with highly tuned software stacks for performance and latency. Such network-speed nodes form the basis for low-latency and high-throughput compute opportunities within the data center, with deep-packet inspection being one such prototypical function.
Network-optimized computing scenarios
As networking bandwidth is increasing, and with bandwidths of 40 and 100 Gb/s on the horizon, there exists a trend to apply distributed (or clustered) computing models to the data center to exploit the inherent scalability and high-availability attributes. As a result, there will be an increasing need for network-optimized systems to dynamically capture the attributes of networking traffic and more tightly couple to the computing via workload management, load distribution, network-oriented services, and application acceleration. Another trend is the emergence of network-based applications such as Voice over Internet Protocol, Internet Protocol television, and video streaming. These applications are dependent on high-throughput rates and/or networking QoS. Network-optimized systems are ideal for fast processing of these types of network traffic.
Generally, at the data-center level, the goal has been to deploy system resources in an optimized manner to reduce total cost of ownership and simplify system management. For example, to overcome management and scalability issues with traditional 1U rack servers, blade systems were developed. Blades balance server, network, and storage technologies to reduce space and power while continuing to track future technology advancements [9] . Today, other models are being explored to provide higher aggregation levels (e.g., the rack level) that are driving the need for network-oriented services collocated within clustered servers [10] .
Network-optimized systems are beneficial for processing at various points within the data center to direct and partition network flows accordingly. Depending on the size of a cluster domain and the services provided, the processing of traffic in and out of the domain may be compute intensive in itself. Network-optimized systems can provide these services, with examples including security, XML acceleration, and workload distribution. An advantage of this approach is the ease of providing scaled networking services to match computational capacity while maintaining flexibility of options for the compute hardware.
As virtualization is deployed on servers (e.g., hypervisors) to improve compute utilization and on networks to provide convergence (e.g., local area network, storage, and clustering) onto a common fabric, the traffic on the network will dramatically increase. This trend further drives the need for network-oriented services and accelerators to offload computationally intensive network flows. The mapping of network flows to MMT and dedicated accelerator technologies are effective methods for scaling performance in an efficient low-power manner.
Network optimization has advantages for services delivery and cloud computing as well. MMT and virtualization techniques can allow compute clouds to dynamically accept workloads and load balance appropriately to achieve processing at high network speeds. This is particularly true for purpose-built clouds, which may use workload-optimized systems to dynamically tailor the systems to the needs of specific users [11, 12] . The recently proposed wireless network cloud [11] is an example of the vision that suggests that telecommunication central offices will become mini-data centers, become part of a cloud, and provide on-demand wireless services. Network-optimized systems, taking full advantage of MMT and accelerator technologies, can provide the necessary consolidation, performance, and management.
Expanding beyond the data center and cloud computing, it is expected that new and emerging classes of applications will be possible as networking and compute functionality are simultaneously considered. The IBM Smarter Planet initiative [13] is one such example in which vast amounts of data will be created, analyzed, and disseminated worldwide at networking speeds. In such a scenario, network-speed components would process transient data collected from a variety of data sources, including sensors, in real time. Such a system is composed of network-interconnected and workload-optimized components, each of which is tailored for the data rates and formats at their respective inputs.
Software development and programming models
As discussed above, the combination of MMT and hardware acceleration technologies enables new levels of power, performance, and cost efficiencies. However, the resulting heterogeneous processors and hybrid systems, namely systems that exhibit heterogeneity at the system level, pose challenges in the design of software. An example of a hybrid system is a system composed of a main processor board or blade within an accelerator or an offload card in the PCI Express** (PCIe**) form factor.
The tremendous rate of software development in the past few decades is successful due to the common programming models for homogenous processors and/or uniform systems with a low degree of parallelism. Higher level programming languages and rich application frameworks lowered the barrier for creating software that dominates current commercial systems and applications. The challenge for the programming model of MMT and accelerator-based hybrid systems involves both restructuring existing applications and developing new applications that exploit novel hardware features while sustaining programming productivity. We consider MMT and hardware acceleration issues separately.
MMT
As shown in Listing 1, the exploitation of parallelism offered in MMT-based processors needs to be addressed at all levels of the software stack, from the operating system to runtime environments and middleware, to programming languages and applications. Even an open-source development platform such as Eclipse [14] has made the transition toward supporting parallel application development. While parallel programming has been a mainstream activity in the high-performance computing domain for several decades, it is only in the last few years that the proliferation of multicore processors has brought a focus to thread-level parallelism for developers of commercial applications. For instance, the Intel Corporation has been providing a library for Cþþ programmers in its Threading Building Blocks commercial product [15] , which has recently been incorporated into another product known as Parallel Composer. In addition to the tools, since most application parallelization efforts follow one or more of a number of well-known parallel computation patterns, some of the challenges and solutions in this area can be addressed by judicious use of patterns [16, 17] .
A leading effort at IBM to address parallel programming is the open-source project Amino, also known as Concurrent Building Blocks [18] , which began in mid-2007 with ambitious goals to provide a comprehensive set of tools for both Cþþ and Java programmers. Figure 1 shows the extent of Amino concurrent libraries or building blocks that are currently being constructed at all layers of the software stack. The initial set of patterns provided by Amino to build parallel applications will include, but is not limited to, MapReduce [19] , master-worker, divide and conquer, and pipeline [17] .
Another significant direction in exploitation of MMT-based processors involves compiler innovations. Examples of compilers with innovative features include dynamic runtime compliers such as just-in-time (JIT) Java compliers and compliers that parallelize the code with directives in annotated programs. Figure 2 illustrates the range of programming intrusiveness from code that requires Listing 1 Opportunities to exploit parallelism at all levels of the software stack. Multithreading enablement will occur at multiple levels. The phrase network mashups simply refers to Web pages or applications that combine data or functionality from two or more external sources to create a new service. (PHP Hypertext Preprocessor is a popular scripting language used in developing dynamic Web pages.) Figure 1 Software stack for Amino libraries, from the operating system to the application layers. Examples are shown of typical components for each layer: MapReduce, introduced by Google, is an example of a parallel pattern framework, and Minimum Spanning Tree (MST) is an example of a component that will be parallelized.
no change in exploiting multithreads to code that requires complete rewrite using a parallel language. Moving from left to right in Figure 2 , we indicate the following.
An annotated single-thread programVa program written using an open-multiprocessing (OpenMP) application programming interface that consists of a set of compiler directives, library routines, and environment variables that influence runtime behavior. Explicit threadsVprograms are written using p-thread-type constructs, libraries, and runtime environments. This method does not require a special parallelizing compiler, since all the parallelism is explicitly stated in the sequential style through the use of p-threads or similar support. Parallel languagesVprograms are written in a language such as X10 [20] or Unified Parallel C [21].
Parallelizing compilers, either static or dynamic, extract parallel threads from a single-thread or annotated single-thread programs. The parallel threads can be functional threads contained within the program or can be a collection of functional threads and other threads such as assist threads (e.g., prefetching threads). The parallelizing compiler can also generate speculative threads that can correspond to work expected to be useful (e.g., depending on a condition) or actual work that will be needed much later and that is executed ahead of time.
Hardware accelerators
Whether hardware accelerators are used at the system level, in hybrid systems, or at the processor level, they create new challenges for software development beyond those of the homogenous MMT-based systems. Examples of the challenges include the potential use of different instruction sets and memory models by the accelerators from the general-purpose processors (GPPs) and the overhead in data movement between the GPP and accelerators in offloading GPPs. Considering the various layers of the software stack, it is generally desired to exploit the accelerators at the lower layers of the software stack such as the operating system layer to hide them from the developers of the upper layers of the stack, i.e., at the application layer. However, this is not always possible. Figure 3 shows some basic forms of exploiting accelerators by software and provides examples at each layer. In the following section, we show detailed experimental results on the benefits of an XML accelerator to emphasize hardware acceleration advantages.
While it is not always possible to exploit accelerators through interfaces that are independent of the application domain, a large class of computational accelerators, such as graphics accelerators and the synergistic processor units (SPUs) in the Cell Broadband Engine** (Cell/B.E.**) processor [22] , lend themselves to generic interfaces for broad applicability such as the open computing language (OpenCL**) [23] standard. These standards facilitate the ubiquitous programmability of accelerators for a larger class of application developers. OpenCL, which is an open standard for parallel programming of heterogeneous systems, along with a working group composed of major industry leaders, is a step in the right direction for programming simplification of computational accelerators. This is achieved by providing a unified programming model for CPUs, graphics processing units, Cell/B.E. SPUs, and other processors in a system. In contrast with computational accelerators, the interfaces to data accelerators, namely, accelerators dedicated to fast processing of a particular form of data (such as XML accelerators), are mainly domain specific at this point in time. In the next section, we show that, aside from unique programming challenges, data accelerators can be highly effective in increasing the overall performance of a system.
SOABench example for MMT and accelerator effectiveness
The software system in data centers is becoming ever more complex as the number of integrated services collocated within clustered servers is increasing. SOA was proposed to address this complexity and provides a software architectural style that simplifies integration of disparate software systems to better align IT and business goals. The concept of mediation is the most critical aspect of connecting services for this integration. The enterprise service bus (ESB) [24] was proposed for easier implementation of mediation. The ESB acts as a software bus, allowing a service to exchange messages with other services by routing, transforming, and adapting the message. In an ESB, XML is generally used as the message format because of its flexibility. The transformation of an XML message can be implemented with XML stylesheet language transformations (XSLT) in the ESB. Since XSLT is compute intensive, it is a good candidate for performance improvement by accelerators.
In this section, we show how MMT and accelerators help in achieving an workload-optimized system for SOA-based applications by using a benchmark, named SOABench [25] . MMT is used to increase the overall throughput of the system, specifically measured as the number of XML messages handled, while the XML accelerator can improve the XSLT performance. We use performance-per-watt as the metric of interest instead of performance alone, considering the significance of power constraints in data centers.
Benchmark methodology
We measured the performance of mediation in an ESB using a benchmark program on three machines. This benchmark consists of three tiers: a client, a mediation server, and a Web-service server, as shown in Figure 4 . The client submits and receives a message written in XML. The throughput is measured in messages-per-second. The mediation server transforms request and response messages from one schema to another schema using XSLT. The mediation server sends the transformed messages to a Web-service server. Next, it receives each message from the Web-service server and sends the transformed message back to the client.
We conducted experiments on two general-purpose servers and one dedicated appliance to allow the mediation server to compare the performance among these three systems. For the mediation server, we used an IBM BladeCenter* HS21 (Intel Xeon** E5345, 2.33 GHz, two-socket, four-core, eight-thread per system with 16 GB of memory) [26] as the current multicore system, and a Sun SPARC** Enterprise T5220 (Sun UltraSPARC** T2, 1.2 GHz, one-socket eight-core, 64-thread per system with 32 GB of memory) [27] as the MMT system. These are general-purpose machines. We also used an IBM WebSphere DataPower* Integration Appliance XI50 (GPP with an XML accelerator) [28] as the purpose-optimized system. The DataPower XI50 is an SOA appliance with two types of XML accelerators. One accelerator is a hardware accelerator that very quickly performs XML parsing and well-formedness checking. The other accelerator, which is different from the hardware accelerators we have discussed in this paper, is an XML compiler. The combination of the hardware accelerator and the XML compiler improves the performance of XSLT compared with the industry-standard Java and C implementation. For the client and Web-service server, we used an IBM BladeCenter HS21 (Intel Xeon E5345 2.33 GHz, two-socket, four-core, four-thread processor with 8 GB of memory).
We used Linux 2.6 on the client running on the Web-service server and on the mediation server for the HS21 blade, and used Solaris 10 for the T5220 server. The benchmark was implemented using Java 2 Enterprise Edition (J2EE) software. The benchmark was deployed in an IBM WebSphere Enterprise Service Bus (WESB) 6.1 [24] on the HS21 blade and the T5220 server. For these two machines, we used the same software parameters such as the number of Web container pools (100), the Java 5 64-bit of IBM Java virtual machine and the Sun JVM**, the garbage collection (GC) policies (generational GC), and the heap size (8.5 GB). The benchmark was manually deployed in the DataPower appliance using the same stylesheet. IBM WebSphere 6.1 [29] was used on the Web-service server. On the client, 100 threads are used to send and receive messages. The warm-up time and steady-state time were 300 seconds each. The user thinking time (i.e., a simulation of real-user idle time, before submitting requests) was 0 seconds. The size of each XML message was 10 KB. The messages were sent using Simple Object Access Protocol/ Hypertext Transfer Protocol (SOAP/HTTP). Table 1 shows the throughput and throughput-perprocessor-watt consumption on the three machines. Throughput-per-processor power consumption is calculated by dividing the throughput by the processor power consumption in watts, as shown in Table 1 . The processor power consumption values of the HS21 blade and T5220 servers are 80 and 95 W per socket, as shown in [30] and [31] , respectively. The processor power consumption of the DataPower appliance is between 160 and 230 W (total for both the GPP and the XML accelerator). As a future step to our current analysis, the power consumption at the system level should be measured.
Experimental results
The combination of XML acceleration and MMT provides substantial performance and power advantages for SOA-based applications. This statement is supported by the following results.
1. The T5220 server achieves 1.7 times higher throughput-per-processor power consumption compared with the HS21 blade while achieving the same absolute performance using the same WESB 6.1. 2. The DataPower with the accelerator achieves 4.5 times higher throughput-per-processor power consumption compared with the HS21 blade without the accelerator.
The first result shows the advantage of an MMT system compared with a current multicore system. In the MMT system with physically small cores, while the number of threads is higher than in the current multicore system, the single-thread performance is lower. In a network-optimized system, it is important that the larger number of threads handles a large number of incoming requests without degrading latency. While the overall throughput is the same for the two systems, the processor power consumption of the MMT system is lower since the small cores consume less power [32] .
The second result shows the advantage of having an XML accelerator compared with a general-purpose system. In the benchmark, we found that more than 50% of the execution time in processing a message is spent in XML parsing and XSLT. Since the accelerator can shorten the latency by improving the performance associated with XML parsing and XSLT, the total throughput is higher. As shown in Table 1 , the system with the XML accelerator achieves 6.4 times higher throughput than the system without the accelerator. By taking the processor power into account, the overall advantage in throughput-per-watt is reduced to 4.5 times.
In the benchmark, the techniques to improve the throughput using the MMT system and the accelerator are complementary. While the MMT system can handle the large number of incoming requests, the accelerator can shorten the latency of an incoming request. Thus, we conclude that the combination of XML acceleration and MMT provides substantial performance and power advantages for SOA-based applications.
Future 3-D roadmap for systems scaling and modularity
Future network-optimized systems will be challenged by the degree to which application performance scales with the anticipated networking bandwidth growth rates. Maintaining system balance, throughput, and latency at high networking rates, for systems that utilize parallel cores and accelerators at high I/O bandwidths, poses significant technology tradeoffs beyond traditional semiconductor scaling.
Present semiconductor lithography roadmaps show continuous physical scaling all the way down to 11 nm [33] . To maintain a constant power density for cooling purposes, the scaling attributes of future lithography generations will require that actual implementations run at similar frequencies to the original design. For example, the power density of a 2-GHz core at 45 nm will be similar to the power density of a 2-GHz core at 11 nm. Cooling requirements will make relatively constant frequency a primary attribute of designs that scale beyond 45 nm. Therefore, CMOS scaling beyond 45 nm will primarily improve density and lower the absolute power as the frequency and power density of a given design remain almost constant.
At the system level, interconnect standards such as Ethernet will continue to scale at more than linear rates. Today, Ethernet connectivity at the network edge is at 1-Gb/s per link and is in transition to 10 Gb/s. Industry activities currently underway will soon provide 40-Gb/s and 100-Gb/s networking standards for next-generation systems [34] . The throughput requirements for these future Ethernet standards are likely to grow even faster than the Ethernet link speeds as developers continue to expand functions with respect to the data and protocols transmitted across the links.
The combination of a fixed-frequency design due to semiconductor-scaling properties, and nonlinear Ethernet scaling requirements, causes a fundamental implementation problem. Semiconductor density improvements due to scaling will be off at least by a factor of 2 in meeting the performance requirements at constant frequency. Semiconductor lithography typically advances in two-year increments, delivering up to a twofold density improvement per generation. However, Ethernet appears to require near-term density scaling of almost threefold per lithography generation (more than fourfold over three years and more than tenfold over six years). There are significant challenges to meeting the performance requirements driven by Ethernet scaling, given semiconductor-scaling limitations and bandwidth limitations of traditional chip-to-chip implementations.
Three-dimensional technology allows for the equivalent of twofold to fourfold density improvements beyond normal semiconductor density scaling [35] . This is accomplished by vertically stacking multiple layers of the latest semiconductor technology and interconnecting them with through-silicon vias (TSVs), which are vertical electrical connections between silicon layers. Array chips are proposed to be stacked up to eight high, whereas logic chips are proposed to be stacked up to four high. While the effective density will improve by the desired factor of 2-4, the power density will become worse by a factor of 2-4. Only low power, or power-managed designs, will be able to effectively make use of the density improvements of 3-D.
Architecting for 3-D interconnects will be integral to making use of all of the technology capabilities; simply mapping an existing 2-D design into 3-D will not capture the potential of this new technology. In addition to power-density limitations, there are limitations that must be considered with respect to the number of interconnections allowed between the layers. Initial interconnect schemes have on the order of the same density as off-chip I/Os today. Subsequent generations of 3-D technologies allow for 4-and 50-fold improvements in the amount of interconnect available between the layers. Using 3-D technology capable of fourfold density of TSVs compared with current I/O allows the use of existing on-chip bus interconnect as the layer-to-layer interconnect. This is advantageous, as all of the existing design intellectual properties from the original 2-D chip can be reused. Additionally, 3-D technology supports the integrated approach required by networking applications, combining network I/O and computing on a single effective die, to overcome the traditional chip I/O bandwidth limitations.
Three-dimensional-layer definitions will be critical in defining the modularity of the design. There are many choices of how to partition the logic on each of the up-to-four layers in the stack. We have chosen to investigate a set of partitions that maximizes modularity. As shown in Figure 5 , this definition subdivides the overall function into three unique functional layers, namely, the I/O layer, the compute layer, and the accelerator layer. The I/O layer contains all of the functionality needed for communication from the stack and its package, plus the central coherency arbitration functions and all pervasive functions. The compute layer contains all of the microprocessor cores and their associated caches. The accelerator layer contains all of the unique hardware accelerator functions.
The modular-layer approach to 3-D technology allows for a multitude of development and application scenarios. The layers can each be in different semiconductor lithography generations, since only a common interface is required between the layers. The three key reasons to move a layer to a new lithography generation are density, voltage compatibility, and unique lithography attributes. These decisions can now be determined on a layer-by-layer basis. In contrast, historically, if any circuit required the change, all of the circuitry would need to be changed.
In the modular design scenario, only after determining that the applications of interest can benefit from two-to fourfold additional functions, such as cores and accelerators on a layer, would the change be implemented in a revised lithography generation. For voltage compatibility of each lithography generation, it may make sense to leave I/Os in an older technology. Only for unique lithography attributes, for which unique process steps for embedded dynamic random access memory or field-programmable gate array are necessary, will the layers actually be changed. It is a significant advantage, due to reduced silicon complexity requirements, to determine each of these lithography tradeoffs on a layer-by-layer basis.
The interconnect definition requires that the I/O layer always be in the stack. However, as shown in Figure 6 , the remaining up-to-three layers can be any combination of compute and accelerator layers. We envision total stack heights of two, three, and four layers, the one to three layers above the I/O can have one to three compute layers, one to three accelerators layers, or any combination of compute and accelerator layers up to three. It is also possible to have applications for which only the I/O layer is required to be used as a simple I/O interface chip. These modular layers can independently be developed or migrated through technologies.
Conclusion
In this paper, we have described the transition from general-purpose server systems and application domains to workload-optimized systems driven by networking growth rates. We provided the rationale for WSP, which is a new heterogeneous system that is cooptimized with, and is built around, a chip architecture capturing the attributes of both network and traditional server functions to support application processing at network line rates. Key WSP innovations are the balanced incorporation of MMT, hardware acceleration, and networking functionality. Continued trends in network growth will have an impact on future data center and cloud computing environments, in which purpose-built solutions leveraging network-optimized functionality will be employed. Software and application development remain a challenge for MMT and accelerators, for which considerable industry and university investment in parallel programming models and tools are being made. Significant performance-per-watt advantages of MMT, combined with XML acceleration, were shown through an SOA-based application-benchmark experiment. Future system scaling was addressed through the use of 3-D technologies to maintain application performance as networking line rates grow exponentially.
Figure 6
Configuration possibilities of three modular layers in a four-high stack.
