Abstract-In modern-day VLSI systems, performance and manufacturing costs are being driven by the on-chip wiring needs due to the continuous increase in the number of transistors. This paper proposes a low overhead wave-pipelined multiplexed (WPM) routing technique that harnesses the inherent intraclock period interconnect idleness to implement wire sharing throughout the various hierarchical levels of design. It is illustrated in this paper that the WPM network can be readily incorporated into future gigascale integration (GSI) systems to reduce the number of interconnect routing channels in an attempt to contain escalating manufacturing costs. Both, a system level analysis and circuit level verification of this WPM routing are presented in this paper. A multilevel interconnect network design simulator (MINDS) that uses system level interconnect prediction (SLIP) techniques and HSPICE circuit simulations for optimizing the interconnect dimensions has been used to assess the opportunities for application of WPM wire circuits in high performance digital designs. A custom routing example highlights the ease with which the WPM routing technique can be easily incorporated into the existing VLSI systems. In addition, for a 40 million transistor system case study, this system level analysis reveals that the use of a WPM network could result in an almost 20% decrease in the number of metal layers for less than 4% increase in dynamic power with no loss of communication throughput performance. The key virtues of WPM routing are its flexibility, robustness, implementation simplicity and its low overhead requirements.
I. INTRODUCTION

B
ECAUSE of the continuing advances in semiconductor technology, it is possible to include close to a billion transistors in a single system. The resulting digital system requires a large number of interconnects for data transmission between and within a myriad of logic and memory macrocells. This global and semiglobal interconnect complexity has resulted in systems whose performance is being increasingly restricted by interconnect performance and signal integrity [1] , [2] . In addition, the increase in the number of interconnects has resulted in an increase in the number of metal layers for every new technology generation which introduces a nontrivial increase in the manufacturing cost of the system [3] . It is, therefore, imperative to investigate VLSI interconnect design and implementation methodologies that most efficiently utilize the available wiring tracks in a multilevel wiring network. Various techniques have been proposed in an attempt to reduce the effect of interconnect design on overall system performance. For example, [4] - [6] discuss the various aspects of the network-on-chip (NoC) paradigm, where the data is exchanged between various source-sink pairs using packet-switched or circuit-switched networks. However, most of these complex implementations are focused on sharing a connected network between intellectual property (IP) cores in a system-on-chip (SoC) configuration. Due to the complexity involved in the design and operation of these techniques, it is difficult to apply them to both, between and within these SoC cores. In addition, authors in [7] use an 8-slot time-division multiplexed (TDM) technique for transmitting the data between the network interface unit and the switch unit of the interconnect network. A 64-bit data is time multiplexed and sent over 8 lines. Moreover, [8] describes the use of TDM communication between the different cores of a SoC, over a shared bus. Each core is given access to the shared bus in an interleaved manner using a two-level arbitration protocol. In the cases mentioned above, even where TDM methodologies are used, a significant amount of microarchitectural change to the system and overhead circuitry is necessary.
In contrast to these techniques (TDMA, lottery-based, etc.) that only apply to on-chip buses, authors of this paper propose in [9] a unique approach to interconnect sharing that uses a 2-slot time division multiplexed (2-TDM) routing technique that can be easily applied to both intercore and intracore interconnects in any SoC design. This routing technique takes advantage of the inherent interconnect idleness within a clock period of noncritical path wires and sends multiple data signals over a given range of shared interconnects in that single clock period. By incorporating the 2-TDM circuits and timing strategies into the physical design of a gigascale integration (GSI) system, the signal routing needs and excessive routing congestion could be significantly reduced. This paper investigates the system level impact and circuit implementation of the wave-pipelined multiplexed (WPM) wire routing that is incorporated in an n-tier multilevel interconnect network. To understand the system level impact of this WPM routing, a multilevel interconnect network design simulator similar to [10] is used to design an interconnect network for a digital system. The simulator designed using [10] uses compact expressions for evaluating various design points; however, in order to increase the accuracy of the interconnect network design, the multilevel interconnect network design simulator (MINDS) has been altered to interface with HSPICE and RAPHAEL to more accurately model the interconnect transients. In addition, WPM routing circuits have been seamlessly incorporated into the enhanced version of MINDS, which is referred to as HR-MINDS, to explore the opportunities of the WPM wire routing.
Section II gives a detailed description of the WPM routing technique, its circuit level implementation and a custom routing example to illustrate the ease of application of WPM to the existing systems. A brief discussion of interfacing MINDS with HSPICE and RAPAHEL is presented in Section III. The detailed description of the integration of MINDS, HSPICE and RAPHAEL can be found in the Appendix. A case study exhibiting the advantages of this wire sharing technique at the system level is also highlighted in Section III. The results of this case study illustrate an opportunity whereby close to a 20% reduction in the number of metal levels can be obtained for only 4% increase in the dynamic power with no loss of throughput performance.
II. WPM ROUTING CIRCUITS
A. Theory
It is assumed that all interconnects on a tier have approximately the same wiring pitch and this pitch is proportional to the length of the longest interconnect on that tier [10] . A tier in this paper is defined as a pair of orthogonal routing levels with the same pitch. As a result, the shorter interconnects on any particular tier require less than the allotted time period for transmitting the signal. Hence, the shorter interconnects on the semiglobal or global tiers, which are not in the critical path, remain idle during part of the clock period. The WPM technique takes advantage of this wire idleness and sends one additional data signal during the idle period in a wave-pipelined manner. In fact, to fully utilize WPM routing it is a recommendation of this work that most semiglobal and global wires need to be designed at the register transfer level (RTL) stage such that they have pipeline stages. This would be the most significant constraint on the microarchitecture that would extensively use WPM routing.
1) Interconnect Idleness Distributions:
To illustrate the amount of wire idleness that is present in a current system, a system level simulator similar to [10] is used to simulate a 40 million transistor logic core that is implemented in 100-nm technology with a 1.3-GHz clock and a 1.2-cm core area. Fig. 1 shows interconnect delay normalized to the clock period for all interconnect lengths on different wire tiers of this simulated logic core and the stochastic interconnect demand function for this system as a function of interconnect length [11] . The interconnect demand function gives a cumulative distribution of the interconnect lengths in a macrocell and uses average gate pitch as its units. For our case study gate pitch , where area of the core and number of gates in the core. It can be observed from Fig. 1 that the multilevel interconnect network has been designed such that the longest interconnect on each tier requires a maximum of 80% of the clock period for data transfer from source to sink. The extra 20% of the clock period accounts for clock skew and provides the necessary guardband to ensure a timely transfer of data from source to sink. It can be calculated from Fig. 1 that 67% of wires with length greater than 0.1 mm require less than 60% of the available clock period for data transmission. The WPM routing takes advantage of the resulting idle time and sends a second signal during the idle portion of the clock cycle in a wave-pipelined fashion. A simple wave-pipelining technique similar to [12] is adopted for sending multiple signals in a clock period. The expression for calculating the minimum sustainable pulse width ( ) that can travel down a repeater interconnect circuit without any loss of signal integrity is given in [12] . In WPM routing, two signals are transmitted in one clock period. The first signal is scheduled at the beginning of the clock period and the second signal is scheduled after seconds. Both signals will arrive at the respective sinks within a single clock period as long as (1) where is the 50% latency of the wire channel and is the clock period. The condition in (1) ensures that the second signal reaches the appropriate sink before the end of the current clock period. These interconnects that satisfy delay constraint (1) can be classified as Type-I interconnects. Fig. 2 shows the plot of the left-hand side of (1) for different interconnect lengths in this 40 million transistor logic core. In addition, the corresponding stochastic interconnect demand function [11] for this system is also plotted as a function of wire length. The shaded regions in Fig. 2 illustrate the range of interconnects to which the WPM routing technique can be applied without any loss of latency or performance. In case of the longer interconnects that do not satisfy the delay constraint given by (1), the WPM technique is further modified. Even in this case, the first signal is sampled and transmitted at the beginning of the clock cycle ( ) and the second signal is sampled and transmitted at . However, both the signals do not reach the appropriate sinks before the end of the current clock cycle and hence, they are available to the receiver side circuitry only after (i.e., during second clock cycle). Since, we have assumed that all the circuits of our system sample data at the beginning of the clock period, the data sent at and can be used only at . As a result there is an increase in the signal latency.
However, even if the first set of signals do not reach their respective sinks at , the second set of signals can be scheduled at without losing signal integrity. The second set of signals will reach the respective sinks in the third clock cycle and by that time the first set of signals would have already been used by the receiver side circuitry. The second set of signals can be used at . Therefore, signals can be transmitted at the source side in every clock cycle and be sampled at the sink side in every clock cycle, and the overall throughput performance of the system is maintained.
Since the latency is two clock cycles, the shared interconnect, for this case, can have total delay of (remaining for clock skew and guardband). Hence, the timing constraint in (1) can be relaxed and both the signals would safely reach the appropriate sinks as long as (2) Initially, the interconnects were designed such that they would have a maximum delay of . However, under the new constraint in (2) , and can be larger. This provides us an opportunity whereby it might be possible to reduce the silicon area and the wire area by application of WPM routing technique. The interconnects designed to satisfy delay constraint (2) can be classified as Type-II interconnects.
It is the premise of this work that the WPM routing technique can be readily applied to both, the intracore and the intercore communication in a SoC. Fig. 3 shows two wiring nets. Assuming that there exists some intraclock period idleness, the signals that were being sent on dedicated interconnects earlier will now be sent over a shared interconnect as shown in Fig. 4 . A 2:1 multiplexer and 1:2 demultiplexer are required for correct scheduling and routing of the input and output signals, over the shared resource. Some buffers are also used to ensure there is no loss of data.
2) Source-Sink and Run Length Proximity: The possibility of using shared wire resources instead of dedicated interconnects is also determined based on the physical placement of a given pair of source and sink, or based on the routing of the interconnects such that a given pair of interconnects have shared run length. A wide variety of routing configurations of WPM technique could be used in both regular and irregular routing. Fig. 5 shows two interconnects, A and B that have the sources and sinks close to each other. For application of the WPM technique, the sources should be at a distance less than " " from each other and the sinks should be at a distance less than " " from each other. Here the distance " " is proportional to the average of the two interconnect lengths and is chosen such that the deviation in routing will have minimal impact on delay. If any source-sink pairs satisfy this physical constraint then, depending upon the existence of wire idleness, the two dedicated interconnects can be replaced by a single shared wiring resource with the insertion of WPM overhead circuitry.
On the other hand, Fig. 6 shows two interconnects A and B that have shared run length and are at a distance less than ' ' from each other. Here, the distance ' ' is proportional to the length of the longer interconnect. In this case, one can replace the shorter interconnect A and part of the longer interconnect B by a shared wire. The data that was transmitted over the dedicated wires earlier will now be transmitted partially over the shared wire and partially over the dedicated wire. For this second physical constraint, the interconnects can be of equal lengths too. As long as they have some shared run length, one can replace part or whole of the two interconnects by a single shared interconnect.
B. WPM Circuit Design
Fig. 7(a) and (b) show the schematic diagram of the circuitry required for conventional routing and WPM routing, respectively. Pipeline registers are used at the source and sink side in both the routing techniques for data storage. A simple low overhead circuitry can be used to implement wire sharing and wave-pipelining similar to [12] .
For conventional routing, a driver, a receiver and a suboptimal number [10] and suboptimal size [13] of repeaters are used. A suboptimal number of repeaters are inserted as [10] shows that inserting 50% of the optimal repeaters imposes only 10% performance penalty. Repeater sizing is assumed to be suboptimal because Bakoglu's expression [14] for optimal sizing of the repeaters overestimates the required transistor size [13] . Each repeater consists of an inverter pair. For WPM routing a 2:1 multiplexer and a 1:2 demultiplexer are placed at the input and output, respectively, of the shared wire. Buffers are used at both the outputs of the demultiplexer to maintain signal integrity, and to hold the received value dynamically.
The signals, P0 and P1, from the two different sources are given as input to the two input lines of a 2:1 multiplexer, respectively. A signal ( ) having cycle period equal to global clock cycle and which remains at logic 1 only for , calculated using [12] , is given as input to the select line of the multiplexer. When is high, (beginning of the clock cycle; ) P0 is sampled by transmission gate A and transmitted over the shared interconnect while on low ( ), P1 is sampled by transmission gate B and transmitted.
At the receiver end, is delayed to give and , and these delayed signals are used for sampling the data received on the shared wire.
and are the signals given to the nFETs of transmission gates C and D respectively, while, Line_out is the signal that is transmitted over the shared interconnect and is given as input to the demultiplexer on the receiver side. Fig. 7 (b) shows two delay circuitries at the receiver side. The delay circuitry 1 delays the signal to give such that signal P0 gets sampled by transmission gate C as soon as it reaches the input (Line_out) of the demultiplexer. The second signal P1 follows P0 on the shared wire with a time difference of . Hence, the delay circuitry 2 further delays to give such that transmission gate D samples signal P1 at the appropriate time. It should be noted that only one of the two transmission gates C and D is ON during sampling of signals received on the shared interconnect. These delay circuitries, 1 and 2, could be shared among multiple shared interconnects to distribute the resulting overhead.
C. Noise Issues
In order to ascertain the limits of WPM routing performance, it has been assumed in the subsequent analysis that the dynamic delay effects due to interwire coupling noise is kept at a minimum. This performance limit can be achieved by using existing low-noise design techniques, such as co-planar ground wire insertion, staggered repeater design [15] , and/or through careful deviation in interconnect routing [16] . This type of lownoise design can be used to reach the limits of the minimum pulsewidth and to maximize the number of Type-I interconnects in a design.
However, if the aforementioned techniques cannot be used, then the delay of noisy wires becomes dependent on switching patterns. In this case the minimum pulsewidth ( ) will have to be appropriately increased so that the receiver circuits will sample the data signal correctly. A larger value of will reduce the number of Type-I wire channels and result in a commensurate increase in the number of Type-II channels in a design. The ultimate impact of exceedingly noisy interconnect channels could be an increase in the number of microarchitectural changes needed to account for a larger number of Type-II wire channels. Fig. 8 shows the timing waveforms, generated using HSPICE, for the two data signals sent over a 0.5-cm-long shared interconnect in a single clock cycle. A pitch of 1.05 is used for this interconnect. The pitch value is selected based on the interconnect network design obtained for the 40 million transistor logic core described in Section II. Signal P0 sends bit stream 0 110 010 while signal P1 sends bit stream 0 110 110. When goes high, the transmission gate A samples and transmits the signal P0 over the shared interconnect. When goes low, the input signal P1 is sampled and transmitted by the transmission gate B. At the receiver side, whenever, is high, transmission gate C samples the data at the input of the demultiplexer (Line_out) and gives it as output OP0. At this time transmission gate D is OFF. When goes high, transmission gate D samples and transmits data on the shared wire. This corresponds to signal OP1. It can be observed from Fig. 8 that both input signals, P0 and P1, reach the appropriate sinks within one clock cycle, and are read in correctly by the positive edge triggered pipeline registers. HSPICE simulation shows that the delay of multiplexer and demultiplexer is small. The delay due to the 2:1 multiplexer and 1:2 demultiplexer for the 0.5-cm interconnect is 33.1 and Fig. 9 . HSPICE generated timing waveforms of wave-pipelined multiplexed circuit for two interconnects designed using the delay constraint (2). 50 ps, respectively. Compared to the allowable wire delay of 550 ps this is quite small.
D. Timing Issues in WPM Circuit
For interconnects that do not satisfy the delay constraint in (1) but exhibit source-sink or run length proximity the same circuit in Fig. 7(b) is used. As explained in Section II, the latency of the signals will be two clock cycles and the constraint in (2) is used.
For example, if the signal P0 is sampled at , then signal P1 will be sampled at by the multiplexer. Assuming the first datum of P0 reaches Line_out at (accounting for any clock skew and guardband), the first datum of P1 will reach Line_out at . These signals will be used by the appropriate receiver side circuits at . Meanwhile, the second datum on P0 and P1 will be sampled and transmitted at and respectively, and will be used by the receiver side circuits at . Here, the delay circuitries will have to suitably designed so that will go high at to sample P0 and will go high at to sample P1. Fig. 9 shows the HSPICE waveforms of two interconnects, each of length 0.7 cm. These interconnects do not satisfy the timing constraint given in (1) when the same pitch as that for 0.5-cm interconnects in the earlier case is used. Hence, as described in the earlier section, these interconnects are redesigned so as to satisfy timing constraint (2) . The new pitch is 0.586 . This helps in reduction of wire and silicon area. P0 and P1 are sampled and transmitted by the multiplexer when goes high and low, respectively. As can be seen from Fig. 9 both the signals require more than one clock cycle to reach the input of the demultiplexer (Line_out). When is high, data at Line_out is sampled and transmitted to give OP0. On the other hand, when is high, data at Line_out is sampled and transmitted to give OP1. This data, OP0 and OP1, is used two clock cycles after it is transmitted at the source. The second set of data is scheduled at and is used by the receiver side circuitry at . The delay due to 2:1 multiplexer and 1:2 demultiplexer is 14.3 and 30 ps, respectively, which is again very small compared to the allowable interconnect delay. Thus, though the overall latency of the system increases, the total communication throughput performance of the system is maintained. 
E. Custom WPM Routing Example
A description of the 1.3-GHz fifth-generation SPARC64 microprocessor design is given in [17] . Using the die micrograph in [17] , approximate length of the interconnects between the floating point (FP) macrocell and the load/store (LS) macrocell, and the fixed point (FX) macrocell and the LS macrocell are estimated to be 1.023 and 0.75 cm, respectively. It is assumed that the interconnects travel from the center of one macrocell to the center of the other macrocell.
Given that it is a 64-bit microprocessor and it has 2 FP units, one can assume that there will be four read ports (therefore, 4 64 interconnects) and two write ports (therefore 2 64 interconnects) on the FP macrocell that sends/receives data from the LS unit. In addition to these data lines, there will be additional control lines to send and receive various handshaking signals between the two macrocells; however, these control lines have been ignored for this case study. Thus, there will be a total of 384 interconnects (set A) between the two macrocells. Similarly one can assume that there will be 384 interconnects (set B) between the FX macrocell and the LS macrocell.
In order to determine any existence of wire idleness, the interconnects in set A and set B are modeled using Level 49 HSPICE parameters for 130-nm technology [18] . The interconnect pitch and thickness values for the processor design are obtained from [17] and are shown in Table I . A suboptimal number of repeaters [10] , having suboptimal size [13] , are inserted on the wires.
The processor design in [17] has a die size of 1.81 1.599 cm and, hence, the interconnects of length 1.023 cm and 0.75 cm are assumed to be global interconnects that are routed on metal level 7 or 8. Hence, the interconnect width is considered to be 900 nm [17] . Table II shows interconnect delay for the two interconnect lengths obtained using HSPICE. Delay for the interconnect of length 0.75 cm is just 0.427 ns i.e., 0.55 times the clock period and from [12] the minimum pulsewidth evaluates to 0.131 ns. The sum of interconnect delay and minimum pulsewidth is 0.557 ns which is less than 0.8 times the clock period. Thus, delay constraint (1) is satisfied. On the other hand, the interconnect of length 1.023 cm has a delay of 0.585 ns which is 0.76 times the clock period. The minimum sustainable pulsewidth evaluates to 0.129 ns using [12] for this case. Hence, it does not satisfy the delay constraint in (1). Fig. 10 shows the abstraction of the floor plan of the microprocessor described in [17] . The WPM routing can be applied to all interconnects in set B if they satisfy the proximity constraints. One can then reduce the number of routing channels by 50% without any loss of throughput performance and the latency constraint of the signal is maintained. For interconnects in set A, though the constraint in (1) is not satisfied the WPM routing can still be applied to all interconnects and the routing channels can be reduced by 50% as long as the physical layout constraints are satisfied. Just as a traditional 2-stage pipeline increases the latency to twice the clock period, here the latency would increase to twice the clock period but the throughput performance would be maintained. Interconnects of set A could require some redesign at the RTL stage to account for this data latency change. Once the system is verified to work using shared interconnects for set B, then WPM could be seamlessly incorporated at the logic and circuit levels of design.
Thus, depending on the physical layout of the macrocell, there are various opportunities for incorporating the WPM wire sharing technique. The maximum advantage of the WPM routing can be obtained by incorporating the WPM wire sharing design approach in the CAD layout algorithms.
III. SYSTEM LEVEL ANALYSIS
It is shown previously that with the application of the WPM routing, it is possible to reduce the number of interconnects. However, to understand the impact of this WPM routing throughout the system, one needs to model a system having a multilevel interconnect network.
A. HSPICE RAPHAEL MINDS (HR-MINDS) Simulator
An optimal n-tier multilevel interconnect architecture design methodology for GSI has been proposed in [10] . This design methodology optimizes the interconnect cross-sectional dimensions on each tier and computes the logic macrocell area, cycle time, power dissipation and number of metal layers based on the given digital system design parameters. The proposed predictive design methodology also helps define the process technology parameters for future generations of microprocessors and ASICs. This design methodology uses compact models to generate the system design. A MINDS is developed based on the design methodology described in [10] .
However, preliminary comparison between the data obtained from compact models and HSPICE models shows some discrepancy. The compact models used in the design methodology underestimate the interconnect pitch for the various interconnect lengths resulting in an inaccurate system design. Hence, in order to make the design methodology more accurate and robust, HSPICE and RAPHAEL are interfaced with MINDS to increase the accuracy of device and circuit models. HSPICE is primarily used to determine the optimal pitch value for the different interconnect lengths based on the performance constraints of the GSI system while RAPHAEL is used to extract inductance and capacitance values for the different interconnects. While designing the n-tier multilevel interconnect network, an interconnect is assumed to have two neighboring ground lines and two ground planes in the case study. A more detailed description of the methodology used in HR-MINDS for designing the multilevel interconnect architecture can be found in the Appendix.
B. WPM Routing Case Study
In order to estimate the advantages of the WPM technique, HR-MINDS is modified to incorporate the WPM design in the overall interconnect design methodology. The algorithm for WPM design follows more or less the same set of steps as for the non-WPM design.
A digital system consisting of 40 million logic transistors is designed using 0.1-technology parameters with a three-input (six transistor) NAND gate chosen to represent the average standard gate. Copper and a low-( ) dielectric material are used to design the multilevel interconnect architecture. The system is assumed to have a die area of 1.2 cm and is operated at 1.3 GHz. A suboptimal number [10] and a suboptimal size [13] of repeaters are inserted on the interconnects. Repeaters are inserted beginning from the topmost tier and are successively inserted on lower tiers based on the availability of free silicon area. The overhead circuitry for the shared wires is accounted while assigning wires to different tiers. Signal integrity analysis using HSPICE simulations demonstrate that the transistor sizing for the multiplexer and demultiplexer pair can be smaller than that of the driver and receiver sizing. Hence, the increase in silicon area due to the application of WPM routing is minimal.
The two factors, source-sink and run length proximity that determine the application of the WPM technique, are highly dependent on the physical layout of the system. Hence, to model the impact of WPM circuits on a system, a wire sharing efficiency factor ' ' is considered. This factor quantifies the fraction of all the wires to which the WPM technique can be applied. For our case study the wire sharing efficiency factor is varied over a range of 20%-100%. Thus, various wire routing patterns and/or source-sink pair placement patterns can be considered while determining the system level impact of the WPM technique.
The fraction of the total number of interconnects that satisfies the physical and delay constraints could be fairly large. However, it is not beneficial to apply the WPM wire routing technique to all the potential interconnects. In order generate an optimal design, a cutoff length is considered. The WPM routing technique is only applied to those interconnects that have their lengths greater than this cutoff length and satisfy the physical layout constraints. Fig. 11 shows the demand function [11] curves for the various cutoff lengths, for a 40 million transistor system. A 60% wire sharing efficiency has been used while plotting these demand curves. As expected for lower cutoff lengths the demand curve saturates at lower demand function values. This saturation of demand curves at lower values indicates that a lesser number of interconnects need to be routed resulting in a proportional decrease in the number of metal levels. Table III shows the number metal levels that would be required for designing the 40 million transistor GSI system for various cutoff lengths. For the system under consideration the longest interconnect length is 2.19 cm (twice the die edge) and the number of metal levels required is 9.3 ( 10) for the conventional case. Table IV shows the total power dissipation for the various cutoff lengths. As expected there is an increase in the power dissipation due to the increased overhead circuitry as one goes toward lower cutoff lengths. The conventional design dissipates a power of 19.51 W at 1.3 GHz. Fig. 12 shows the variation in percent reduction in the required wire area with cutoff length for various wire sharing efficiencies. As can be seen from the plot, a higher reduction in the required wire area for higher wire sharing efficiency can be obtained as the WPM technique can be applied to a larger number of interconnects. In addition, the percent reduction is higher for lower cutoff lengths. Close to 20% reduction in the total number of metal levels can be obtained for a wire sharing efficiency of 60% and cutoff length of 0.85 mm. The percent increase in the dynamic power of the system because of the application of the wire sharing technique, for various wire sharing efficiencies, is shown in Fig. 13 . The increase in the dynamic power is primarily due to the overhead circuitry required for implementing wire sharing. There is no reduction in power because of the elimination of the dedicated interconnects and the repeaters on those interconnects, due to the proportional increase in the activity factor of the shared resources that replace the dedicated interconnects. For a wire sharing efficiency of 60%, a 4% increase in the dynamic power of the system for a cutoff length of 0.85 mm is observed. Both Figs. 12 and 13 also plot a trend where the WPM technique can be applied to all the interconnects, i.e., 100% wire sharing efficiency. One can get more than 30% reduction in the required wire area for around 8% increase in the dynamic power.
C. Comparison of Application of WPM Routing to Type-I versus Type-I and Type-II Interconnects
Though the benefits of using WPM routing technique are obvious, the interconnects designed using delay constraint (2) may require some redesign at the RTL stage to account for the interconnect in latency. Depending on the design approach, these changes may or may not be trivial. If the changes are going to be nontrivial, then it may be advantageous to apply WPM only to those interconnects that satisfy the delay constraint (1). Fig. 14 shows the reduction in wire area when WPM routing is applied to Type-I interconnects only, and Type-I and Type-II interconnects. Here, the WPM routing technique is applied to the interconnects routed on tier 2 and above. Since all interconnects may not satisfy the proximity constraints, the wire sharing efficiency factor is varied from 100% to 20%. As expected, there is larger reduction in wire area and more increase in power, when WPM routing is applied to both Type-I and Type-II interconnects. A significant fraction of interconnects on a tier satisfy the delay constraint (1) . Close to 15% reduction in the required wire area is obtained for a wire sharing efficiency of 60% when WPM routing is applied to only Type-I interconnects. The increase in power due to application of WPM routing technique is shown in Fig. 15 . For a wire sharing efficiency of 60% there is more than 4% increase in power when Type-I interconnects are designed using WPM routing. As can be seen from Figs. 14 and 15, there is not much difference between the reduction in wire area and increase in power for both cases.
Future GSI designs are expected to have multiple cores on a single chip. The communication protocol between any two cores is expected to be latency insensitive [19] , [20] . The WPM routing technique can be used in combination with NoC configuration for this latency insensitive intercore communication. Any increase in the interconnect latency after application of WPM routing will be easily absorbed by the latency insensitive communication protocol.
In addition, it is highly improbable to apply the wire sharing technique to all the interconnects due to the physical layout constraints. If the WPM design approach is incorporated in the CAD layout algorithms then it might be possible to gain maximum reduction in the number of metal layers for a small increase in the dynamic power.
IV. CONCLUSION
A unique WPM routing technique that takes advantage of the intraclock period wire idleness is proposed. Using this wire routing technique it is possible to send multiple signals over a single interconnect in one clock cycle. Due to its simplicity and robustness of application, this WPM technique could be easily incorporated in the new GSI systems without any architectural changes. This technique has the potential to become a ubiquitous routing technique that can be easily applied to both intercore and intracore interconnects in any SoC or microprocessor design.
The custom routing example illustrates the opportunities whereby the WPM technique can be incorporated into the existing system design and the number of routing channels can be reduced by 50% with no loss in the throughput performance. The system level impact of WPM routing is described in detail.
With the help of a new and more accurate system simulator (HR-MINDS) to determine the advantages of the WPM wire network, close to 20% reduction in the number of metal layers with only 4% increase in dynamic power and virtually no loss in communication performance can be observed by application of this WPM technique to future GSI systems.
APPENDIX HR-MINDS
The optimal n-tier multilevel interconnect architecture design methodology for GSI that has been proposed in [10] uses compact models for evaluating various design variables. However, preliminary comparison between the data obtained from compact models and HSPICE models shows that one cannot directly port the system design to the circuit level as the compact models underestimate the interconnect pitch for various interconnect lengths. This results in an inaccurate system design. Hence, HSPICE and RAPHAEL are integrated with MINDS to make the design methodology more rigorous. HSPICE is primarily used to determine the optimal pitch value for the different interconnect lengths based on the performance constraints of the GSI system while RAPHAEL is used to extract inductance and capacitance values for the different interconnects. The revised design methodology for generating the interconnect network design is presented here.
The stochastic wire length distribution [11] is used to obtain an a priori estimate of interconnect lengths in a logical block while designing the n-tier interconnect network. The interconnect density function that describes the macrocell wiring given in [11] is Region Region (3) where interconnect length in gatepitches; number of logic gates; Rent's exponent; Rent's coefficient; fraction of sink terminals in the macrocell; normalizing factor. Shortest wires are routed on the lowest tier (collection of levels with the same wiring pitch) and have the smallest pitch value, and successively longer wires go on upper tiers with progressively larger pitch values. Interconnects on adjacent metal layers are routed orthogonally and they have the same wiring pitch. An interconnect tier is formed by grouping pairs of metal layers that have the same wiring pitch.
The key step in this design methodology is to determine an optimal pitch value for the interconnect of the longest length on a tier. In the new design approach, HSPICE is used to calculate the optimal pitch value for an interconnect length and RC2 solver in RAPHAEL is used to extract the inductance and capacitance values of the interconnect under consideration. An RLC model of the interconnect is generated based on the design parameters, and the interconnect delay is evaluated using HSPICE for various pitch values. For the selected optimal pitch value, the interconnect delay is equal to an acceptable fraction of the cycle time.
Similar to the design methodology in [10] , a suboptimal number of repeaters are inserted on each interconnect. In addition, based on [13], Bakoglu's expression for optimal repeater size [14] , overestimates the required transistor size resulting in a non realistic design point. So a suboptimal number of repeaters, each having a suboptimal size are inserted onto the interconnects.
The suboptimal value for the number of inserted repeaters is calculated as ( ) times Bakoglu's expressions [14] for optimal number of repeaters. Thus, the suboptimal number of inserted repeaters is calculated as follows: (4) where interconnect capacitance; interconnect resistance; output resistance of a minimum sized inverter; input capacitance of a minimum sized inverter. Suboptimal sizes for the driver, receiver and repeaters are calculated as ( ) times Bakoglu's expressions [14] for optimal repeater size. Thus, the suboptimal repeater size is calculated as follows: (5) While calculating the interconnect pitch, the design methodology assumes a unity aspect ratio i.e.,
, where is the metal width, is the metal thickness, is the spacing between the interconnects, is the height of the interlevel dielectric, and is the tier pitch. The wiring efficiency factor [21] , [22] is assumed to be constant at 40% for all the levels in the case study. This factor accounts for via blockage, power and ground lines and routing efficiency of the CAD tools.
The following assumptions are made while interfacing HSPICE and RAPHAEL with MINDS.
1) Given that a longer execution time will be required due to the integration of HSPICE and RAPHAEL with MINDS, a dynamic lookup table method is adopted to reduce the total execution time. The lookup table stores an interconnect length and its corresponding optimal wiring pitch that satisfies the performance constraints, as and when the pitch value is calculated during the simulation. If the length of a new interconnect under consideration is less than or greater than any particular entry in the lookup table by 50 gate pitches, then the pitch value is read from the appropriate entry in the lookup table. 2) The total time delay of the interconnect is the time required for the input signal to reach 50% of the at the input of the receiver.
3) The longest interconnect in the system is assumed to be gatepitches, where is the number of gates in the system. Though the probability of having an interconnect of length gatepitches is very small, the methodology considers a worst case scenario where system design has a global bus traveling from one corner of the system to its diagonally opposite corner. 4) The interconnect time delay expressed as a fraction of the cycle time i.e., beta ( ), is assumed to be 0.25 for shorter interconnects as they would most likely constitute the critical path. For the longer interconnects is assumed to be 0.8 as these longer interconnects would primarily be used for cross chip communication.
The revised design algorithm uses a bisection method for evaluating the pitch for a given interconnect length to further reduce the execution time. Starting with an upper limit of cm and a lower limit of cm for the pitch, the midpoint of this range is considered as the first estimate of the pitch for the interconnect under consideration. This wiring pitch is selected for the interconnect under consideration if the HSPICE model of the interconnect satisfies the timing constraint i.e., the absolute value of the difference between the total time delay of the interconnect and (beta clock period) is less than 2.5%. If the timing constraint is not satisfied, then depending on whether the interconnect delay is lesser or greater than (beta clock period), the range of pitch values to be considered for the next iteration is appropriately reduced. The midpoint of this new range is considered as the estimate of the pitch value for the next iteration. Thus, the range of the pitch values is reduced for each new iteration until the midpoint of the range satisfies the timing constraint. In case of shorter interconnects, the wiring pitch that satisfies the timing constraint evaluates to less then twice the feature size. For these interconnects, the wiring pitch is set to a default minimum value of twice the feature size.
The new interconnect network design methodology in HR-MINDS follows the same set of steps as given by [10] .
The only difference is that MINDS uses compact models to determine the pitch value of the interconnect while HR-MINDS uses HSPICE and RAPHAEL. 
