This paper introduces a novel architecture for next-generation adaptive computing systems, which we term 3D-SoftChip. The 3D-SoftChip is a 3-dimensional (3D) vertically integrated adaptive computing system combining state-of-the-art processing and 3D interconnection technology. It comprises the vertical integration of two chips (a configurable array processor and an intelligent configurable switch) through an indium bump interconnection array (IBIA). The configurable array processor (CAP) is an array of heterogeneous processing elements (PEs), while the intelligent configurable switch (ICS) comprises a switch block, 32-bit dedicated RISC processor for control, on-chip program/data memory, data frame buffer, along with a direct memory access (DMA) controller. This paper introduces the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing as a next-generation computing system. The paper further describes the advanced HW/SW codesign and verification methodology, including high-level system modeling of the 3D-SoftChip using SystemC, being used to determine the optimum hardware specification in the early design stage.
INTRODUCTION
System design is becoming increasingly challenging as the complexity of integrated circuits and the time-to-market pressures relentlessly increase. Adaptive computing is a critical technology to develop for future computing systems in order to resolve most of the problems that system designers are now faced with due in no small part to its potential for wide applicability. Up until now, however, this concept has not been fully realized because of many technology constraints such as chip real-estate limitations and the software complexity. With the coupled advancement of semiconductor processing technology and software technology, however, adaptive computing is now facing a turning point. For instance, the reconfigurable computing concept has more recently started to receive considerable research attention [1] [2] [3] and this concept is now starting to move and expand into the realm of adaptive computing. Software defined virtual hardware [4] and "do-it-all" devices [5] are good examples that demonstrate this development direction for computing systems. The major forthcoming impact from the deployment of adaptive computing is do-it-all devices. For example, a small handheld PDA size device could assume the functionality of about 10 standard devices simply depending on the context programs included such as a cellular phone, a GPS receiver, an MP3 player, an e-book reader, a digital camera, a portable television, a satellite radio, a handheld gaming platform, and so forth. This concept also becomes increasingly important as there is a growing need for a single product to support multiple (and evolving) standards without reengineering work.
Another growing problem in advanced computation systems, particularly for real-time communication or video processing applications, is the data bandwidth necessary to satisfy the processing requirements. The interconnection wire requirements in standard planar technology are increasing almost exponentially as feature sizes continue to shrink. A novel 3D integration system such as 3D system-on-chip (SoC) [6] , 3D-SoftChip [7, 8] which is able to satisfy the severe demand of more computation throughput by effectively manipulating the functionality of hardware primitives through vertical integration of two 2D chips is another concept proposed for next-generation computing systems. This paper proposes the novel 3D-SoftChip architecture as a forthcoming giga-scaled integrated circuit computing system and shows an implemented example of a single PE using SystemC. Figure 1 illustrates the physical architecture of the 3D-SoftChip comprising the vertical integration of two 2D chips. The upper chip is the intelligent configurable switch (ICS). The lower chip is the configurable array switch (CAP). Interconnection between the two 2D chips is achieved via an array of indium bump interconnections. A 2D planar architecture of the 3D-SoftChip can be seen in Figure 2 . The rest of the paper is organized as follows. Section 2 introduces an overview of the 3D adaptive computing system. Section 3 provides overall explanations of the proposed 3D-SoftChip architecture and its distinctive features. Sections 4 and 5 introduce the detailed architecture of the CAP and ICS chips, respectively. The interconnection network structure is described in Section 6. Section 7 describes a suggested HW/SW codesign and verification of the 3D-SoftChip and shows an implemented example of a single PE using SystemC. Finally, conclusions are provided in Section 8.
3D ADAPTIVE COMPUTING SYSTEM

3D vertically integrated systems overview
During the past few years, there has been significant research into 3D vertically integrated systems. This is due to the ever increasing wiring requirements, which are fast becoming the major bottleneck for future gigascale integrated systems [6, 9] . In very deep submicron silicon geometry the standard planar technology has many drawbacks in regards to performance, reliability, and so forth, caused entirely by limitations in the wiring. Moreover, the data bandwidth requirements for the next-generation computing systems are becoming ever larger. To overcome these problems, the concept 3 of 3D-SoC, 3D-SoftChip has been developed, which exploits the vertical integration of 2D planar chips to effectively manipulate computation throughput. Previous work has shown that the 3D integration of systems has a number of benefits [10] . As described by Joyner et al. [10] , 3D system integration offers a 3.9 times increase in wire-limited clock frequency, an 84% decrease in wire-limited area, or a 25% decrease in the number of metal levels required per stratum. There are three feasible 3D integration methods; a stacking of packages, a stacking of ICs, and a vertical system integration as was introduced by IMEC [9] . In this research, however, the focus is on the use of indium bump interconnection technology as indium has good adhesion, a low contact resistance, and can be readily utilized to achieve an interconnect array with a pitch as low as 10 µm. The development of 3D integrated systems will allow improvements in packaging costs, performance, reliability, and a reduction in the size of the chips.
Adaptive computing system
A reconfigurable system is one that has reconfigurable hardware resources that can be adapted to the application currently under execution, thus providing the possibility to customize across multiple standards and/or applications. In most of the previous research in this area the concepts of reconfigurable and adaptive computing have been described interchangeably. In this paper, however, these two concepts will be more specifically described and differentiated. Adaptive computing will be treated as a more extended and advanced concept of reconfigurable computing. Adaptive computing will include more advanced software technology to effectively manipulate more advanced reconfigurable hardware resources in order to support fast and seamless execution across many applications. Table 1 shows the differences between reconfigurable computing and adaptive computing.
Previous work
Adaptive computing systems are mainly classified in terms of granularity, programmability, reconfigurability, computational methods, and target applications. The nature of recent research work in this area according to these classifications, is shown in Table 2 . This table shows that the early research and development was into single linear array-type reconfigurable systems with single and static configuration but also shows that this has evolved towards large adaptive SoCs with heterogeneous types of reconfigurable hardware resources and with multiple and dynamic configurability.
As illustrated in Table 2 , the 3D-SoftChip architecture has several superiorities when compared with conventional reconfigurable/adaptive computing systems resulting from the 3D vertical interconnections and the use of state-of-theart adaptive computing technology (as will be described in the following sections). This makes it highly suitable for the next generation of adaptive computing systems. Figure 3 shows the overall architecture of the 3D-SoftChip. As can be seen, it is comprised of 4 unit chips. By including four separate unit chips in the architecture, sufficient flexibility is provided to allow multiple optimized task threads to be processed simultaneously. Given the primary target applications of multimedia processing and communications four unit chips should be sufficient for all such requirements. Each unit chip has a PE array, a dedicated control processor, and a high-bandwidth data interface unit. According to a given application program, the PE array processes large amounts of data in parallel while the ICS controls the overall system and directs the PE array execution, data, and address transfers within the system.
3D-SOFTCHIP ARCHITECTURE
Overall architecture of 3D-SoftChip
Features of 3D-SoftChip
The 3D-SoftChip has 4 distinctive features: various computation models, adaptive word-length configuration computation [7] , optimized system architecture for communication, and multimedia signal processing and dynamic reconfigurability for adaptive computing.
Computation algorithm: various computation models
As described before, one 32-bit RISC controller can supply control, data, and instruction addresses to 16 sets of PEs through the completely freely controllable switch block so various computation models can be achieved such asSISD, SIMD, MISD, and MIMD as required. Enough flexibility is thus achieved for an adaptive computing system. Especially, in the single instruction multiple data (SIMD) computation model, 3 types of different SIMD computational models can be realized, massively parallel, multithreaded, and pipelined [19] . In the massively parallel SIMD computation model, each unit chip operates with the same global program memory. Every computation is processed in parallel, maximizing computational throughput. In the multithreaded SIMD computation model, the executed program instructions in each unit chip can be different from the others so multithreaded programs can be executed. The final one is the parallel SIMD computation model. In this case each unit chip executes a different pipelined stage. Because of these SIMD computation characteristics, the 3D-SoftChip can adaptively maximize it's computational throughput according to various application requirements. These three computational models are illustrated in Figure 4 .
Word-length configuration
This is a key characteristic in order to classify the 3D-SoftChip as an adaptive computing system. Each PE's basic processing word-length is 4 bits. This can, however, be configured up to 32 bits according to the application in the program memory. Figure 5 illustrates the proposed word-length [7, 20] and the completely freely controllable switch block architecture in the ICS chip.
Optimized system architecture for communication and multimedia signal processing
There are many similarities between communications and multimedia signal processing, such as data parallelism, lowprecision data, and high-computation rates. The different characteristics of communication signal processing are basically more data reorganization, such as matrix transposition and potentially higher bit-level computation. To fulfill these signal processing demands, each unit chip contains two types of PE. One is a standard PE for generic ALU functions, which is optimized for bit-level computation. The other is a processing accelerator PE for DSP. In addition, special addressing modes to leverage the localized memory along with 16 sets of loop buffers in the ICS add to the specialized characteristics for optimized communication and multimedia signal processing.
Dynamic Reconfigurability for Adaptive Computing
Every PE contains a small quantity of local embedded SRAM memory and additionally the ICS chip has an abundant memory capacity directly addressable from the PEs via the Chul Kim et al. indium bump interconnect array. Multiple sets of program memory, the abundant memory capacity, and the very highbandwidth data interface unit makes it possible to switch programs easily and seamlessly, even at runtime.
ARCHITECTURE OF CAP CHIP
The basic architecture of CAP chip is a linear array of heterogeneous PEs. Figure 6 shows three possible architecture choices for the PEs. The architecture in Figure 6 (b) is suggested as the most feasible architecture for the PE in the 3D-SoftChip because it has the optimum tradeoff between application-specific performance and flexibility. Examples of type A can be seen in [1, 2, 12, 14] , type B in [17] , and type C in [18] . The CAP chip has the basic role of the processing engine for the 3D-SoftChip. It manipulates large amounts of data at a high-computational rate using any of the three different SIMD computation models previously described. Figure 7 illustrates the two types of PE architecture chosen to optimize multimedia signal processing and communication type applications.
Two types of PEs
Standard PE
The S-PE is for standard ALU functions and is also optimized for bit-level operation for communication signal processing. It comprises 4 sets of 19-bit registers for S-PE instruction decoding, two multiplexers to select input operands from the data bus, adjacent PEs, or internal registers; a standard ALU with a bit-serial multiplier, adder, subtracter, and comparator, an embedded local SRAM and 4 sets of registers. The arithmetic primitives are scalable so as to make it possible to reconfigure the word-length for specific tasks. The scalable arithmetic primitive's architecture is presented in [7, 20] . Moreover it can execute single-clock-cycle absolute value computation and comparison. Table 3 shows the functions of S-PE. It is suitable for bit-wise manipulation and generic ALU functions.
Processing accelerator PE
The PA-PE is dedicated specifically for digital signal processing DSP operations. It consists of 4 sets of 19-bit registers for PA-PE instruction decoding, two multiplexers to select input operands from the data bus, adjacent PEs or internal registers, a signed 4-bit scalable parallel/parallel multiplier, an accumulator/subtracter modified to enable MAC 
PE instruction format and operation modes
The PE instruction format consists of a 19-bit instruction word. The MSB 2 bits (WS en/RS en,WR en/RR en) are used for the read/write enable bit of the embedded SRAM and registers. Bits 16 to 10 are used for SRAM and register selection (addressing). Bit 9 is used for data output register enable signal and bits 8 to 6 are used to specify the PE operation. Finally, bits 5 to 0 are used to control the input multiplexers for input operand selection. This format is illustrated in Figure 8 below. Figure 9 illustrates 3 types of PE operation modes that can be realized on the PE array, horizontal mode, vertical mode, and circular mode, these allow for even greater 8 EURASIP Journal on Applied Signal Processing flexibility and help to maximize computational throughput according to the target application.
Embedded local SRAM
Each PE has a local embedded SRAM. The effective memory bandwidth is, therefore, increased dramatically by as much as the total number of PEs, which will result in an increase in effective processing speed in many applications and allows for rapid dynamic context switching. Bus traffic can also be reduced because many data transmission operations can be contained within a PE. Consequently, power dissipation will also be minimized.
Quad-PE
As previously described one quad-PE consists of two pairs of PEs (two S-PE and two PA-PE). The quad-PE is controlled and configured by the switch block according to the control and address data from the ICS transmitted through the IBIA. Figure 10 shows the architecture of a single quad-PE.
ARCHITECTURE OF ICS CHIP
The ICS chip comprises the switch blocks, ICS RISC, program memory, data memory, data frame buffers, and DMA controller as illustrated in Figure 11 . The ICS chip is a control processor which controls the CAP chip via the IBIA as well as the overall system. The ICS RISC provides control and address signals and data to the system as a whole. The switch blocks configure each PE based on the current program instruction. The high-bandwidth data interface unit enables efficient transmission of data and instructions within the system.
Switch block
The switch block provides data from/to each PE and also provides instruction data to each PE. Three types of switch blocks, 6-sided, 7-sided, and 8-sided, provide optimized interconnections within the ICS chip. A pass-transistor design is used to optimize performance and minimize area allowing a completely free configuration for each PE.
ICS RISC
The ICS RISC is a 32-bit dedicated RISC control processor. The ICS RISC controls the execution of the PE array and provides control and address signals to program/data memory, the data frame buffers, and the DMA controller. It has a 3-stage pipelinedarchitecture that includes instruction fetch (F), decode (D), and execute (E). To cope with the iterative nature of DSP arithmetic, it also has 16 sets of loop buffers so as to provide direct instruction to instruction decoding instead of fetching from program memory in each case. This significantly reduces bus utilization allowing for improved performance and lower-power dissipation. Moreover 32 general purpose registers and specialized addressing modes are provided for optimized communication and multimedia signal processing.
High-bandwidth data interface unit
The high-bandwidth data interface unit allows the efficient transfer of data within the 3D-SoftChip. Two sets of data frame buffers and the DMA controller make it easy to transfer large amounts of data. Multiple sets of program memory support runtime program switching and, because of this dynamic reconfiguration feature, adaptive computing is possible. The data memory has a variable word width so it can easily be combined to build wider/deeper memories and thus increase flexibility for different application programs andmultiple word-length computations.
INTERCONNECTION NETWORK
The interconnection network of the 3D-SoftChip can be broken down into three hierarchical levels. The Inter-PE bus between PEs in the CAP chip is the first level. This local interconnection network has a 2D-mesh architecture providing nearest-neighbour interconnects between the PEs. The second level of the interconnection network is the switch block array interconnection. This supports longer interconnections on the ICS chip but also has a basic 2D-mesh architecture. The last hierarchical level of interconnection is the indium bump interconnect array (IBIA). With the progression of technology to ever decreasing semiconductor geometry scales, the prediction of interconnection delay as well as its impact on total system delay are crucial factors, introducing a major limiting factor in overall system performance. To overcome these problems, 3D interconnection technology using an array of indium bumps becomes very attractive because it supports a very high bandwidth coupled with a very low inductance/capacitance (and thus low-power dissipation) [8] . However, any other equivalent 3D-interconnection technology could also be applied to realize this interconnection level within the 3D-SoftChip architecture.
Indium bump interconnection
Indium is an excellent material to use as an interconnect material due to its excellent adhesion to most metals, including aluminum, which is the metallization for the pads used in most VLSI technologies. Indium has a low melting point, which implies a low work-hardening coefficient, allowing for direct bonding on processed VLSI wafers. Additionally, it provides excellent mechanical as well as electrical connectivity (contact resistance < 1 mΩ per bump). Reflow techniques can be used for flexibility and to increase the bump height to width ratio as needed. Such techniques can also be used to incorporate self-alignment features to the bonding process. Figure 12 (a) illustrates a cut-away view of the flip-chip indium bump interconnection, a micrograph of a single indium bump after reflow can be seen in Figure 12 (b). Figure 13 shows the HW/SW codesign methodology for the 3D-SoftChip. HW/SW partitioning is being executed to determine which functions should be implemented in hardware and which in software. The HW is currently being modeled at a system level using SystemC [21, 22] to verify functionality of the operation and to explore various architecture configurations while concurrently modeling the software in C. After that, a cosimulation and verification process will be implemented to verify the operation and performance of the 3D-SoftChip architecture and to decide on an optimal HW/SW architecture. More specifically, the SW will be a modified GNU C Compiler and Assembler. After the compiler and assembler for ICS RISC has been finalized, a program for the implementation of the MPEG4 motion estimation algorithm will be developed and compiled using it. After that, object code can be produced, which can be directly used as the input stimulus for an instruction set simulator and system level simulation. The HW/SW verification process can be achieved through the comparison between the results from instruction-level simulation and system-level simulation. From this point on, the rest of the procedure can be processed using any conventional HW design methodology, such as full and semicustom design. Figure 14 shows the single Standard PE block diagram, file structure of SystemC modeling and the output waveform of system-level modeled Standard PE. Figure 14 : System level modeling of single PE: (a) standard PE block diagram, (b) file structure of standard PE, and (c) the output waveform of system-level modeled standard PE.
HW/SW CODESIGN AND VERIFICATION METHODOLOGY
System level modeling of single PE
CONCLUSIONS
A novel 3D vertically integrated adaptive computing system architecture for communication and multimedia signal processing has been presented along with system-level modeling example of a single PE. The described system leverages the very high-bandwidth connection between two chips, realizable through the indium bump interconnect array, to combine high-level ICS and low-level CAP processing engines to create a next-generation adaptive computing system. The described system architecture of the 3D-SoftChip is currently being fully modeled in SystemC in order to determine the optimal hardware architecture. The SW design is being concurrently finalized so that the novel concept of an adaptive system-on-chip computing system can be realized.
