Abstract-Recent advances in three-dimensional integrated circuits (3D-ICs) offer a new dimension of design exploration at traditional physical architecture of datapath components. The emerging monolithic inter-tier vias (MIVs) provides more advantages over through-silicon vias (TSVs) in terms of higher integration density and lower design overhead. In this work, we develop a performance-driven framework which uses simulated annealing to produce gate-level 3D placement layout for rotation shifter and right arithmetic shifter design. Compared to the optimum 2D layout, the critical path of our solution is much shorter with limited overhead on total wirelength. Our work indicates that by gatelevel 3D-IC integration, the new physical dimension can be well leveraged with improvement on both performance and power of shifter design.
I. INTRODUCTION
S HIFTERS are indispensable datapath components in the MPU and ASIC, e.g., floating-point units and encryption units due to its efficient bitstream operation and rotation. In contrast to the works with various physical optimization methods for adders, multipliers, and dividers, there are relatively less prior research focuses on layout optimization of shifters, mainly due to its simple logic functionality and regular structure. Nevertheless, as a basic datapath component in digital logic design, shifters has a broad spectrum of application and could impact the system performance in a larger scale. Moreover, the wiring inside each shifter module is quite dense. In order to resolve such bottleneck on the overall design performance, improvement on timing and power behaviors of shifters becomes an important subject.
During the past decade, the development of 3D-ICs has offered practical solutions [1] - [3] to many existing problems in the current IC designs. Insertion of vertical connections, such as the through-silicon vias (TSVs) proposed at early years, greatly shortens the distance between connected modules thus reduce the total wirelength and dynamic power, improve the routability and timing behavior. However, the huge dimensions of TSVs also induce area overhead [4] . Recently, emerging advances of monolithic inter-tier vias (MIVs, Fig. 1 ) largely reduce the physical dimensions of the vertical connections to be of only metal-via sizes. The high-density integration makes monolithic 3D-IC a promising solution [5] , [6] to cope with interconnect-limited 2D-ICs, where most of the problems are essentially caused by the high interconnect density at gate level.
Previous researchers mainly focus on optimizing either the logic architecture or the physical layout of the shifters to improve the performance and power. Hillebrand et al. [7] proposed an approach to half the wirelength of a barrel shifter using permutation of cell positions. Zhu et al. [8] discussed the opportunities of architecture enhancements by logic synthesis and integer linear programming (ILP) based synthesis, where both the wire load on the critical path and the switching probability at inter-stage wires are reduced to improve timing and power. All the above approaches are based on the traditional 2D physical structure, where the narrow design space is mostly boxed by the dense wires and long critical path.
In this paper, we propose to optimize the timing and power behaviors of shifter layout under a monolithic 3D-IC physical structure. Similar to [9] , our 3D stochastic approach is based on a simulatedannealing framework, which efficiently generates placement solutions with comparable quality to that of the optimal solution. We develop a stochastic software tool for placing modules of rotators and right arithmetic shifters. We validate our approach by experiments of placing the circuits of shifters of different scales. Moreover, we provide timing and power predictions based on the propagation path length of each shifter testcase as well as a well-defined RC model for monolithic 3D-ICs at 65nm technology node. We also validate our optimized shifter designs in SPICE using 45nm NANGATE open source library.
Our work shows the feasibilities and benefits of improving the physical layouts of shifter architecture using monolithic 3D-ICs, while it further indicates the potential improvement on other larger datapath components by stochastic optimization. The rest of the paper is organized as follows. In Section II, we introduce the background knowledge and related works in literature regarding optimization of shifter layout at 2D-and 3D-scope. In Section III, we demonstrate our approach of stochastic placement on rotation and arithmetic shifters, respectively. Timing and power analysis of the produced layouts are discussed in Section IV. We validate the algorithm in Section V with promising experimental results. Finally, we conclude and discuss future works in Section VI.
II. PROBLEM STATEMENT

A. Rotator and Arithmetic Shifter
We focus on rotation shifter (rotator) and right arithmetic shifter to demonstrate our method. 
In arithmetic shifter (Fig. 4 , which contains some dummy cells for regularity) with linear order design, there is no wrap-around long wire as shown Fig. 2 . The functionality here is right arithmetic shifting while extending most significant bits (MSB). 
B. 2D ICs, Monolithic 3D ICs and Shifter Folding
In 2D ICs, the wires of shifter need to go across many cell widths to reach the MUX cells in the corresponding position at the following logic level. With the help of MIVs in 3D ICs, the wires can utilize the vertical "short cuts" and reach the target cell instead. Fig. 5 shows that the shifter is folded into 3D by cutting the bits in each level equally with N/L where L is the number of layer, and assigned them in multiple layers. Fig. 5 illustrates the bit slice folding strategy similar to the Figure 3 (c) of [10] .
C. Optimization via Cell Permutations and ILP
Zhu et al [8] showed cell permutations can reduce long wire length in rotator. The method for 2D ICs cell permutation is detailed in [8] , which permutes the MUX cell in the same level y = 0 → n − 2. We extend the ILP from [8] to 3D scenario, and use Gurobi Package [11] to solve the ILP problem. [8] to M3D ICs : First, we model MIVs as vertical interconnects. WMUX is the unit width of a MUX cell. αWMUX is the wire length of MIV connecting two adjacent layers, where α is a ratio for MIV over the width of MUX cell.
1) Extension of ILP
The formulation is described as follows. s l i,j,k ∈ {0, 1}: 1 if and only if logic cell i is placed at physical position of x = j and z = k at logic level y = l. The physical cell is occupied by one logic cell, so the following constraints should be met,
The span length of the path from (i1, l) to (i2, l + 1), where l is the logic level index in shifter, is calculated as follows,
where cell i2 at logic level l + 1 is one fan-out node of cell i1 at logic level i. dx is the number of MUX cells a wire goes across in x-direction. dz is the number of MUX cells (scaled by α) a wire goes across in z-direction. Then, d is the number of MUX cells a wire go across in the 3D layout from (i1, l) to (i2, l + 1).
Along a path from input bit a at level 0 to output bit b at level n − 1, we sum up
) and obtain T path a→b . T path is the set that contains T path a→b for all possible combinations (a, b), which form paths in a shifter. The goal of ILP here is to minimize the maximum value of these T path ,
2) Timing Driven Placement and Simulated Annealing Methods: Compared to ILP, simulated annealing method is more scalable. To deal with large placement problem, Sechen et. al. introduced Timberwolf [9] based on such stochastic method. Also, for the purpose of timing improvement, several timing driven simulated annealing-based placement are investigated, e.g., [12] , [13] . In the section III, we provide details of our scalable simulated annealingbased method.
III. OUR APPROACH
In this section, we propose an approach for efficient shifter placement. Our objective is to reduce a mixed of timing and total wire length with a weighing factor γ. Marquardt et al. [12] and Eguro et al. [14] use the auto-normalizing strategy to calculate cost of a move during simulated annealing iterations. Before incorporating such strategy in our shifter placement, we first introduce our timing cost function and total wire length cost function.
A. Timing Cost Function
The shifter has a simple structure, so that timing analysis is straightforward. We use static timing analysis (STA) during the SA iteration and obtain the slacks for each edge e. Then the summation of edges, which belong to one net neti, form the weight of neti. For timing driven design, we emphasize the critical path and nets, the exponential parameter θ is set to a number larger than 1. Then, the edge e's weight is
where slack(e) in the shifter is updated via STA, the procedure is similar in [12] . Therefore, the weight of ni is the summation of its edges' weight,
where e is in the set of ni. For wire set network of shifter, the total timing cost function is obtained by summation of all nets' weight,
B. Total Wire Length Cost Function
Suppose L(ni) is the length of net ni. The cost function of total wire length is the summation of L(ni).
C. Auto-Normalizing Cost Function
where W * prev is the value obtained from previous iteration, which is used to normalize the weights from timing and total wire length.
D. Stochastic Optimization Algorithms
We integrate the simulated annealing (SA) framework to form our 3D shifter placer. The pseudo code of our algorithm is illustrated in Algorithm 1. (9), we can determine the cost of movement. In SA, we accept every improved movement greedily (Δcost ≤ 0). For "bad" movement (Δcost > 0), the acceptance of one movement is based on probability, which has relation with current temperature T (line 16 of Algorithm 1). In addition,
are the weight differences between the weight after and before one inner loop, we use these to be a metric of determining the frozen status of solution. Tstart is the temperature at the beginning, it is set as a high value and then reduced slowly to mimic annealing process. T updated in the outer while loop. The algorithm is terminated when the condition OuterLoopExit == T RUE, which is the solution approaching "frozen" and no further improvement happens from current status (line 27 to 34). During the inner while loop, we permute the cells as many times as possible in the inner while-loop, of which the complexity is positive proportional to the number of cells, O(nlog(n) ). When the "trials" or "changes" becomes too large, it triggers the termination of movement in current temperature, which means either too cold or too hot for the solution in SA (line 20-22). Set trials as number of swapping attempts. changes is increased after the acceptance of a tentative swap. If the ratio of changes trials is too small, i.e 1%, or the difference of the weights before and after the inner loop is slight different (under a threshold t), we regard it as approaching frozen status and trigger the frozen counter (line 27). When the frozen counter is larger than T hreshold, we terminate the algorithm.
Algorithm 1 Stochastic Optimization of Shifter Placement
while OuterLoopExit = T RUE do 
while InnerLoopExit = T RUE do 
IV. DELAY AND POWER ANALYSIS
In order to make high-level evaluation, we create our 3D ICs model based on [8] MIVs in monolithic 3D ICs are treated as special vertical interconnects.
A. Layout Model
We use the logical effort method [15] for delay/power estimation of MUX-based shifter. Logical effort measures the delay of single = gh + p, where g is the logical effort of the gate, which is the ratio of the input gate capacitance to the input capacitance of an inverter with the same unit effective resistance; h is the electrical effort of the gate, which is the ratio of load capacitance to input capacitance; p is the intrinsic(parasitic) delay of gate. For a static CMOS NAND gate, g = 4 3 and p = 2. Therefore, the electrical effort of a NAND gate is h = hfanout + hwlw. where hfanout is the number of load gates and lw is the length of the driven net which is normalized to the width of MUX cell. hw is the electrical effort contributed by the wire per width of MUX cell spanned. Assuming the MUX cell width Wmux is 80λ and NAND gates in the shifter are 2X of the minimum size uniformly. In addition, based on the wire model of [16] , Cw = (CaWw + C f + Cx)Lw where Ww, Lw is the width and length of wire, respectively. Derived from this work, suppose the wire unit of width 4λ, where λ is defined as half of the technology feature size. hw = (4λCa+C f +Cx)×80λ 4 3 Cg ×2
. We treat MIV like a normal interconnect, equivalent to the α × Wmux. Therefore, using a 80αλ MIV, the h = h fanout + hw(l 2) Dynamic Power: The dynamic power is calculated by summation of all the switched capacitance within the shifter, which contains both wire capacitance and gate input capacitance. The resistive RC delay of wires is negligible [16] . The input gate capacitance of NAND gate is 8Cg 3 , also is the capacitance of wire of one column span.
V. EXPERIMENTS AND RESULTS
In this section, two comparisons are made. The first one is to compare permutation-based optimization of 2D/3D designs rotators solved by simulated annealing (SA) and integer linear programming (ILP). The second one is to compare 2D/3D designs of the rotation and right arithmetic shifters and corresponding 2D/3D folding linear ordering shifter designs (LO). The bit widths of shifters are 32, 64 and 128 bits. The metrics are shown as follows.
"LPS (μm/80λ)" is the number of MUX cells that the longest path spans along x-and z-directions. Based on the estimation model of Sec. IV, "Est. Delay (ps/τ )" is the maximum overall path estimation delay D path . "Est. Power (fF )" is the dynamic power and measured by total switched capacitance. In the following analysis of experiment, we call "Est. Delay" and "Est. Power" as delay and power for short.
The solutions are validated by SPICE using 45nm NANGATE open source library. "WD (ns)" is the critical path delay; "DP (ns × mW )" is the delay power product to demonstrate the energyefficiency after optimization using SA and M3D.
For the parameters of SA, D is the half value of LPS at the initial shifter placement, θ = 5, γ = 0.6. The decreasing rate of temperature β = 0.999. The initial temperature Tstart = 400. R = 1, C = 0.01. The device and wire parameters are shown in the Table. I. The ratio α of MIV height over MUX cell is 0.05. The algorithm is implemented in C++. We solve the ILP problem using Gurobi Package [11] . All the experiments are performed on a Linux workstation with an Intel Core i7-920 2.67GHz CPU and 12GB memory.
A. Permutation-based Optimizations using SA and ILP
Due to the poor scalable ILP, we simply compare the LPS's solution quality of 16 bits rotator in 2D, and 2, 4 layers of M3D. It can be found that the LPS of our proposed SA method approximately equals that of the ILP method. In this case, the runtime of ILP is acceptable; However, it does not scale well. For example, to optimize a 32 bits rotator in 2 layer with α = 0.05, ILP spends over days to obtain the solution, while SA only take minutes. Table III shows results of 32, 64 and 128 bits rotators with LO designs and optimized solution by SA. Totally, SA vastly improves LPS after permutation solved by SA in 2D. There are 40%, 31%, 23% of LPS reductions from 32, 64, 128 bits rotators, respectively. The corresponding delay improvements are 22%, 29%, and 49%. With the increasing of shifter bits, we observe the larger timing improvements because the total wire loads along critical paths are actually reduced more and more. However, the power improvement is limited, The permutation help reduce the longest path length but increase the length of other previous short wires to balance the fixed logic connection of shifter. Besides 3D folding with MIVs reducing LPS, our SA method improves timing as well as power. Another aspect of power reduction is resulted from the 3D folding, which reduces the total wire length by providing cheap vertical interconnects. Compared to the optimized solution in 2D LO designs of 32, 64 and 128 bits rotators, there are 33%, 39%, 44% improvements in dynamic power by 2 layers M3D; there are 45%, 53% and 60% by 4 layers M3D. Our SA, compared to simple 2D LO and 3D LO folding, improves timing improvement 32% and reduces power 5% on average. Table IV shows results of 32, 64 and 128 bits right arithmetic shifters. We do not provide the results of 2D SA here because the timing and power improvements are scarce in the 2D cases. It should be noted there are not wrap-around wires as in rotators, hence the space for improvement become smaller. The folding does help reduce LPS, but worsen the timing. It is because some previous short wires in 2D is elongated when folding to another layer, and these wires are often loaded by MUX cells along a path. The naive LO 3D folding is not a wise choice in this scenario. Then, our permutation-based optimization become an important role to compensate such delay penalty, while maintaining the power reduction. Compared to 2D LO, the power reductions by SA are 13%, 18%, and 22% using 2 layers M3D. The delay reductions by SA in 4 layers are around 18% among 32, 64 and 128 bits cases. Totally, our SA has 23% timing improvement and 5% power reduction on average compared to the 2D LO and 3D LO folding strategy. We also run the SPICE simulation and verify the trend of improvement. 
B. Analysis of LO Design and Permutation-based Optimization (SA)
VI. CONCLUSIONS AND FUTURE WORKS
The technology of monolithic 3D IC (M3D) offers VLSI design with a new physical dimension, which can be leveraged for gatelevel 3D circuit design. In this paper, we propose a performancedriven placer to produce 3D rotation and arithmetic shifters. They have better timing and power performance than 2D and simple 3D folding linear order designs. We utilize logical effort method and SPICE simulation to validate the performance of these optimized designs and demonstrate the advantages of our method combined with M3D. In the future, we plan to architect other datapath components, such as adders and multipliers, into M3D.
VII. ACKNOWLEDGEMENTS
We acknowledge NSF CCF-1017864.
