Abstract-In this brief, a strategy to design high-fan-in multiplexers with minimum delay is proposed. The work extends the optimization proposed in Lin, 2000, to the case of switches with driving capability, that exhibit better performance in terms of noise immunity as well as being suitable for voltage scaling, which are becoming increasingly important properties in today's CMOS technologies. Moreover, the design strategy explicitly accounts for wiring parasitics in design equations.
I. INTRODUCTION
Many applications, such as column decoders in memories and resistor-chains in digital analog converters (DACs), require the implementation of multiplexers with a high fan in, N [1] . Two approaches have been traditionally used to implement high fan-in multiplexers: the uniform approach [2] , in which N CMOS switches are interconnected with the output node in common, and the binary-tree structure [3] , based on a tree-like multistage structure. Recently, a more general architecture of fast high fan-in multiplexers was proposed [1] , called heterogeneous tree, that represents a generalization of uniform and binary-tree structures. This architecture allows better speed performance to be achieved than with traditional approaches for a high fan-in, provided that a suitable optimization technique is applied.
The heterogeneous-tree architecture, shown in Fig. 1 , consists of subsequent stages whose switches are appropriately grouped. The switches of each stage perform the multiplexing of the outputs associated with the previous stage, according to the control signals generated by the external decoder. The multiplexer's speed performance strongly depends on the number of stages and switches per group in each stage. Therefore, it must be carefully tuned to maximize speed, while it is unaffected by the decoder.
In [1] , an optimization strategy to evaluate the optimum number of stages and number of switches per group was proposed, assuming switches without driving capability implemented with transmission gates or pass transistors. However, these classes of circuits are surpassed by static CMOS logic in many respects, regarding robustness, signal integrity, noise immunity, suitability for voltage scaling (which is mainly thanks to their input-to-output isolation), driving capability and signal restoration [4] , [5] . Therefore, especially for low-voltage circuits, the static logic approach is preferable with respect to pass transistor and transmission gate implementations.
To implement a multiplexer architecture, a suitable static gate is the tri-state buffer in Fig. 2 , where selection signal, sel, enables input IN, to be transmitted to output, OUT [3] . The circuit in Fig. 2 in place of the switches in Fig. 1 . Of course, since it is an inverting gate, for multiplexer architectures with an odd number of series switches, we could eventually add an inverter at the output. Although the heterogeneous-tree concept can generally be applied, design criteria in [1] are derived only for switches without driving capability. In this brief, heterogeneous-tree multiplexers will be analyzed and optimized to achieve high-speed performance assuming the switches are replaced by the tri-state buffer (Fig. 2) . Multiplexer delay and design criteria to minimize it are analytically derived, and the expressions obtained are general and independent of the technology used. Moreover, the effect of the interconnect parasitic is analytically taken into account using a simple model.
Due to its simplicity, the design strategy found can profitably be used in designing high fan-in multiplexers. Moreover, a preliminary estimation of the achievable performance has also been provided to check whether a delay constraint can be satisfied for a given process and fan-in before optimization.
II. DELAY OF HETEROGENEOUS-TREE MULTIPLEXER
Let us consider the heterogeneous-tree multiplexer with k stages and N inputs depicted in Fig. 1 , where the switches are implemented by tristate buffer (Fig. 2) 
It is worth noting that S j has to be chosen as a power of two, to simplify the decoder circuit which drives the tri-state buffers [1] . The unique path connecting multiplexer input and output can be represented as cascaded tri-state buffers in the ON state (i.e., which gives at the output the input value inverted). In particular, the tri-state buffer associated with the jth stage drives the output capacitance, C out , of the other (S j 0 1) tri-state buffer OFF (i.e., which is in tri-state condition) that belong to the same group as well as the input capacitance of the tri-state buffer of the following stage, C in . Therefore, assuming a negligible load capacitance at the multiplexer output node without loss of generality, the critical path can be assimilated to the equivalent circuit in Fig. 3 , that for each stage reports only the tri-state buffer in the ON state that lies in the path (represented by a box).
To evaluate the path delay, we assume a model of tri-state buffer delay, switch , given by [3] , [4] 
where int is the intrinsic delay and accounts for internal parasitic capacitances, and is a factor that accounts for the linear delay increase due to the external load capacitance, C L , at the output. Hence, by inspection of 
Substituting (4) Delay (5) can be minimized with respect to k, by differentiating PD for k and equating the result to zero. Then, following some calculations and using relationship (4) we find that minimum delay is achieved for the value of parameter S that satisfies the following relationship: where term int +Cin represents the tri-state buffer delay with a unity fan-out. The ratio (int + Cin)=Cout typically ranges from three to four; for instance, for a 0.35-m CMOS process and assuming the minimum-sized tri-state buffer, it is equal to 3.2.
From (6), it is possible to derive the optimum value S opt of parameter S for a given value of ratio ( int + C in )=C out , that depends on the technology adopted, but which is almost independent of the transistor sizing [3] . To solve the nonlinear equation (6), we plotted its solution evaluated numerically in Fig. 4 . By inspection of Fig. 4 , it is apparent that, for practical values of (int + Cin=Cout), Sopt is almost equal to four, which is a power of two as desired. This is in contrast with the results obtained using the transmission-gate switches [1] , in which optimum values of Sj may be far from the nearest power of two.
The optimum number of stages, k opt , can now be obtained by substituting the optimum value S opt = 4 into (4), that leads to 
Substituting the optimum number of stages (7) and optimum group size S opt = 4 into (3), we obtain the minimum delay achievable for a given fan-in N PD; opt = [0:72( int + C in ) + 2:16C out ] ln(N ) (8) where C in has been neglected with respect to the other terms, since they are multiplied by the large term ln(N ).
From (8), the minimum delay achievable with a heterogeneous-tree multiplexer and tri-state buffer, PD; opt , results proportional to ln(N ) via a coefficient that depends on the delay (int + Cin) of a tri-state buffer with unity fan-out and the term C out . Both of the latter are roughly independent of the gate sizing [3] , and depend only on the process used. Hence, relationship (8) can be used by the designer to predict whether a given speed constraint can be satisfied using an assigned process before actually carrying out the design. In Table I , the results of (7) and ln(N ) are reported for typical values of fan-in. It is worth noting that for odd powers of two the number of stages is not integer, and its noninteger part is equal to 0.5. This means that the last stage should be made with half the optimum number of tri-state buffers, i.e., two instead of four. For instance, for N = 128 we obtain a four-stage optimum design, where the last stage includes only two tri-state buffers instead of four. The number of tri-state buffers, m s , required for the implementation of the multiplexer can be assumed as a figure of merit to measure circuit complexity and area cost. For a fan-in equal to an even power of two, the number of tri-state buffers can be evaluated considering that in the first stage there are N tri-state buffers, in the second there are N=4 tri-state buffers, in the third N=16, and so on. Therefore, the total number of tri-state buffers is approximately given by 
IV. EFFECT OF INTERCONNECT
Due to the high number of tri-state buffers required for each stage, interconnect between consecutive stages are long, thus significantly affecting performance. More specifically, capacitance associated with wiring lines must be taken into account when modeling and optimizing the multiplexer (for practical fan-in, and, hence wire length, modeling interconnect with simple capacitances is adequate).
Evaluation of wiring capacitances requires a floorplan to be fixed for the multiplexer. For example, the floorplan for N = 16 and assuming S opt = 4 is shown in Fig. 5 , where tri-state buffers are again represented by a box. Defining X as the height of the cell implementing a tri-state buffer, in Fig. 5 tri-state buffers belonging to the first stage are grouped in an array with height equal to N 3 X . Moreover, the critical path that defines the delay starts from the first input, IN1, and reaches output, OUT, by crossing points A-E since interconnect here is longer than in the other paths. In the critical path of Fig. 5 , capacitance C1 associated with wire at the output of the first stage is equal to the wire length normalized to X multiplied by capacitance C of a wire with length X (i.e., X is used as a unit length). More specifically, from Fig. 5 after neglecting the horizontal wire sections (which are quite short and do not increase with fan-in), capacitance C 1 is associated with a line having a length (in X units) equal to AB+CD = 3+3 = 6, hence C1 = 6C. Analogously, the wire capacitance at the output of the second stage is equal to C 2 = 3C. [1] .
This procedure can be simply applied to cases with higher fan-in, leading to the values of capacitances Cj reported in Table II . By inspection of Table II we can easily generalize the results. In particular, for N equal to an even and an odd power of two we get for j = k (11) TABLE II   TABLE III respectively, and for both cases, the sum of wire capacitances lying in the critical path (remembering that This result shows that multiplexer delay is equal to the sum of delay (3) evaluated without interconnect parasitics and the wire capacitance contribution NC=2. The latter, for an assigned tri-state buffer and technology (that define and C, respectively), does not depend on the number of stages, k, but only on fan-in, N. As a consequence, optimization performed in the previous section with respect to k is not affected by wiring parasitics. We therefore still have S opt = 4 with the optimum number of stages being given by (6), leading to an optimum delay equal to As an example, we designed a multiplexer with a fan in of 256, which could represent the typical case of a memory bank of about 64 Kb. Hence, the optimum number of stages from (7) is four, whose number of four-input groups are, respectively 64, 16, 4, and 1.
The circuit was simulated using a 0.35-m CMOS process, whose main parameters are reported in Table III , assuming minimum-sized tri-state buffers, whose layout shown in Fig. 6 has a height of X = 6 m. From simulations which account for layout parasitics and assuming a power supply voltage of 3.3 V, the parameters describing the tri-state buffer timing behavior have numerical values reported in Table IV . Moreover, interconnecting tri-state buffers in Metal 2 layer, the resulting wire capacitance for a length X is equal to C = 0:54 fF.
The delay predicted by relationship (14) is equal to 3.92 ns, and agrees well with the simulated value of 3.99 ns. Hence, the delay dependence on the input rise time can be neglected since each gate has a relatively high fan out. The number of tri-state buffers is 340.
The effect of wire capacitances on multiplexer delay is significant. Indeed, the delay predicted without accounting for wire capacitances is 2.3 ns, that underestimates actual delay by 40%. It is worth noting that, even though wire parasitics significantly affect overall delay, a further optimization of the aspect ratio of tri-state buffer transistors does not lead to an actual speed improvement. For the process used, assuming minimum transistors for the first stage 1 and progressively increasing the transistor aspect ratio in the successive stages (to compensate for their load capacitance increase due to increasing wire capacitance) lead to a speed increase which is always less than 15%, and the power-delay product can be decreased by only a few percentage points. This is because at each node between two consecutive tri-state buffers, the wire capacitance does not dominate over transistor parasitics. Hence, larger transistor sizing of a stage to decrease its delay increases its loading effect on the previous stage, increasing the delay of the latter. This trend can be confirmed even when process scaling TABLE IV is applied, since transistor capacitances and wire capacitances for local interconnect approximately scale in the same way as minimum channel length [8] - [10] .
To evaluate the optimized heterogeneous-tree circuit considered, it is useful to compare its performance to that of traditional uniform and binary-tree multiplexer, whose floorplan is presented in Figs. 6 and 7, respectively. Both approaches can be thought of as special cases of heterogeneous circuits, by setting k = 1 (i.e., all the switches belong to the same group) for the uniform case and k = log 2 (N ) for the binary-tree case [1] . Moreover, their speed performance is worse than that of the optimized heterogeneous multiplexer, since the number of stages k differs from the optimum value (7) that minimizes the overall delay. As an example, for the design case considered before with fan-in of 256, the delay for the uniform and binary-tree case is equal to 21.2 and 6.4 ns, respectively (i.e., 425% and 60% slower than the optimized heterogeneous multiplexer).
As far as area and circuit complexity, as widely discussed in [1] the heterogeneous tree is made up by a number of fundamental blocks (i.e., tri-state buffers) close to that of uniform multiplexers, and significantly lower than that of binary-tree circuits. In particular, in the previous design example the uniform and binary-tree approach lead to 256 and 510 fundamental blocks
VI. CONCLUSIONS
In this brief, a strategy to optimally design high-fan-in heterogeneous-tree multiplexers has been proposed. This class of multiplexers, that provides a significant speed improvement over traditional approaches, is made up by cascaded stages composed of switches combined into groups. After analytical evaluation of delay, simple and general criteria were derived to choose the optimum number of stages and properly group switches in each stage to minimize delay.
This work extends results in [1] to multiplexers implemented with tri-state buffers, that are more suitable for present implementations thanks to their better noise immunity and suitability for voltage scaling. The analysis shows that tri-state buffers must be combined into fourelement groups for each stage, regardless of fan-in, gate sizing, and process used.
Unlike the paper in [1] , this work also explicitly accounts for wiring parasitics in timing analysis and therefore in design equations. Moreover, a closed-form expression of delay achievable with a given fan-in and process is provided to check whether a certain speed constraint can be met before carrying out the design.
The procedure proposed was validated and applied to a design example of a 256-input multiplexer implemented with a 0.35-m CMOS process. The circuit was simulated taking into account parasitics extracted from layout, and the predicted delay is in good agreement with the simulated data.
Image Parameter Modeling of Analog Traveling-Wave
Phase Shifters
Giancarlo Bartolucci
Abstract-The aim of this brief is to present a modeling procedure for the traveling-wave analog phase shifter. The proposed approach is based on the image representation of two-port networks. Design considerations to optimize the electric performance of the component reducing also the size of the structure are discussed. A commercial software package is used for the final electric simulation of the actual phase shifters to check the validity of the analytic model presented.
Index Terms-Image parameters, microwave, phase shifter.
I. INTRODUCTION
Traveling-wave components are widely used at microwave and millimeter wave frequencies [1] - [3] . These kinds of structures are composed of transmission-line sections periodically loaded by semiconductor devices. Among many possible applications, one of the most significant concerns the realization of analog phase shifters [3] , [4] . In this case, Schottky diodes are used as semiconductor elements. Under the reverse bias condition each diode can be replaced with its depletion capacitance, so, obtaining the structure in Fig. 1(a) . By changing the bias voltage V b it is possible to change the capacitor values, and therefore also the phase of the output signal. The elementary cell of the traveling-wave phase shifter is shown in Fig. 1(b) . Because the component is realized by the cascade connection of identical symmetric cells, the image parameter representation for two-port networks will be used to obtain a simple and rigorous analytic model for the whole structure. It is worth noting that the approach will be directly applied to the semi-distributed elementary cell of Fig. 1(b) , without any lumped element approximation for the transmission line sections. Design considerations and implications arising from the analytic results will be presented and discussed, to suggest the best choice for the electric parameters characterizing each basic cell. In comparison with merely numerical techniques, the proposed method allows not only to solve the design problem in analytic form, but also to have a better insight of the physical behavior of the structure.
II. THE ANALYTIC MODEL
Assuming a central operating frequency f0, each basic cell, and therefore the whole phase shifter, is completely characterized by three electric parameters: the electric length 0 , the capacitance C(V b ), and the characteristic impedance Zc. In the following, 0 and C(V b ) will be replaced by two new variables P 0 and B 0 P 0 = Z c tan 0 ; B 0 = ! 0 C(V b );
where ! 0 = 2f 0 :
The elementary cell, shown in Fig. 1(b) , is a lossless reciprocal and symmetrical network, and therefore, two only image parameters can be used for its narrow-band representation: the image impedance Zic0 and the image phase c0 [5] . These quantities are easily computed as 
Because of the particular topology of the component, the image impedance of the phase shifter (Z if0 ) is same as each basic cell (Z ic0 ). Therefore, the input matching condition can be directly imposed on Zic0
Zic0(Zc; P0; B0m) = Z0
where Z0 = 50 . For fixed Zc and P0 values, the matching susceptance B 0m is the unknown. After some algebra, (1) and ( 
From the condition of physical realization for B0m B 0m > 0:
Two different solutions must be considered for equation (4) . These are as follows.
• 0 > =2 (that is, P 0 < 0). In this case, to fulfill the condition (5),
we have Z c < Z 0 .
• 0 < =2 (that is, P0 > 0). This is by far the most common case, and, from (5), we obtain Z c > Z 0 . In the following, only the solution with 0 < =2, which allows for having quite a short component, will be considered. It is worth noting, that, for a given P 0 , a remarkable size reduction for the phase shifter can be achieved by choosing a very high characteristic impedance Zc, so, obtaining a low value for the electric length 0 . For this reason, it will be assumed that Z c = 100 , that is very close to the highest value manufacturable in microstrip or coplanar technology. Since the phase shifter is composed of series connected identical basic cells, the image phase for the whole structure f0 is N c0 , N being the total number of cascaded cells. A variation of the susceptance B0 from B0m allows for changing the phase of the output signal, thus having the required phase shifting effect, but causing a mismatch too. In order to evaluate the new phase value, and also the insertion losses related to the mismatch, the following procedure is proposed.
