We empirically study the implications of area-array I O for placement methodology. Our work develops a three-axis testbed that examines 1 I O regime area-array vs. peripheral pad locations, 2 I O and core placement methodology variants of alternating vs. simultaneous I O and core placement approaches, and 3 placement engine hierarchical quadratic for both core and I O cells vs. pure min-cut for core cells and assignment for I O. Experimental data show that the area-array I O regime is somewhat more forgiving" of bad placement methodologies than the peripheral I O regime. On the other hand, the wrong methodology can still entail substantial losses in solution quality and efciency. Last, we hypothesize that reductions of on-chip wirelength from adopting the area-array I O regime may be correlated with topological depth of circuits.
INTRODUCTION
IC packaging technologies with peripheral I O pads have well-known shortcomings. Observed system Rent parameters suggest that ICs require asymptotically more pads than the die perimeter can provide 15 . Peripheral I O pads also constrain clock power distribution, and their inherently large parasitics cause coupling and power issues for o -chip signaling. Given these concerns, the area-array I O regime is projected to eventually dominate IC implementation methodology, a ording improved pad count and reliability, reduced noise coupling, and cost savings as the technology matures.
With respect to physical design, area-array I O presents several critical new issues. Locating pads directly over the core layout region requires new tools for signal integrity analysis and simultaneous die-package simulation. The possibility of synergetic on-chip and on-package routing opens up many new problems, particularly in the distribution of clock and power. Finally, place-and-route tools will need to handle new layout constraints and objectives, e.g., for noise decoupling and ESD protection. Much e ort remains in developing tools and methodology, and the precise impact of area-array I O on such basic parameters as routability, required global interconnect resource for the design, etc. remains largely unknown.
In this work, we focus on the implications of the areaarray I O regime on I O and core placement methodology for row-based ICs. Speci cally, we examine whether the shift to area-array I O requires changes to methodologies e.g., alternating I O and core placement and engines e.g., top-down quadratic placers" that integrate sparse-system THIS WORK WAS SUPPORTED BY A GRANT FROM CADENCE DESIGN SYSTEMS, INC. solvers with min-cut partitioners that are perceived to be successful in the peripheral I O regime. Our speci c contributions include:
We develop a three-axis testbed that allows examination of 1 I O regime area-array vs. peripheral pad locations, 2 I O and core placement methodology variants of alternating vs. simultaneous I O and core placement approaches, and 3 placement engine hierarchical quadratic for both core and I O cells vs. pure min-cut for core cells and assignment for I O.
Under modest assumptions, we experimentally show that area-array I O leads to smaller wirelengths suggesting better routability than peripheral I O. Again experimentally, w e show that the area-array I O regime is somewhat more forgiving" of bad I O and core placement methodologies. On the other hand, the wrong methodology can still hurt solution quality and or runtime. The remainder of this paper is organized as follows. In Section 2, we review related previous work, while Section 3 gives a model of I O regimes, along with a taxonomy of I O and core placement methodologies and placement engines. Section 4 describes details of our experimental testbed, and Section 5 gives experimental studies of di erent placement methodologies in the peripheral and area-array I O regimes. Section 6 concludes with a description of ongoing research e orts.
PREVIOUS WORK
For the intrinsic area-array regime, 1 to which our studies apply, T an et al. 17 pose the layout problem as: Given a core portion of the chip which already contains the I O bu ers, place the possible uniformly spaced area-array pads on the top metal layer of the design ... and route these pads to the I O ports of the chip...". For a given xed core placement, Tan et al. propose a pad placement and assignment methodology that searches for feasible pad locations until there are 10 more identi ed pad locations than I O ports. Assignment is used to match I O ports to pad locations. Farbarik et al. 5 report a suite of CAD tools for intrinsic area-array ICs, notably an area-pad oorplanner that drives the placement of blocks and associated 1 Intrinsic area-array ICs are those originally designed and laid out for area-array bonding; extrinsic area-array ICs are originally designed and laid out for peripheral bonding, and later converted to the area-array regime using a signal redistribution layer 17 . Maheshwari et al. 11 distinguish the same concepts as true area-I O" vs. redistribution".
I O bu ers so as to minimize routing cost to area pads. 2 A corner-stitched maze router is used to perform the area-pad routing. Kiamilev et al. 9 propose three distinct layout methodologies for intrinsic area-array ICs. The second approach is similar to that of 5 but uses manual routing; the third approach separately optimizes the core and I O bu er circuit placements on chip, then places area pads directly above corresponding I O bu ers using a two-dimensional array structure called a pad interface m o dule.
An important w ork vis-a-vis our own is that of Maheshwari et al. 11 , which addresses timing and wirelength implications of transitioning a eld-programmable MCM implementation to the area-array I O regime. In 11 , analysis of a theoretical model based on random placement and Rent's rule suggests that area-array I O o er s a small improvement on the total wirelength, but not much", since the proportion of external nets decreases as the circuit grows large. On the other hand, experimental results with an FPGA layout tool indicate that area-array I O can afford increased routability around 13 average reduction in wiring resource usage and smaller parasitics, particularly for combinational benchmarks.
Finally, I O placement methodology has been extensively studied for row-based designs and the peripheral I O regime. RITUAL 16 leaves I Os oating on the core boundary allowing infeasible positions and legalizes them during detailed placement. PACT 13 notes the potential weak I O placement quality of RITUAL, and proposes linear assignment to compute an initial i.e., before core cells are placed I O placement. Analysis of circuit structure and path timing constraints is used to determine assignment costs. The authors of PACT 13 observe, by w a y o f motivation, that the common practice of alternating I O and core placements starting from an arbitrary initial I O placement can be time-consuming; furthermore, even if convergence is achieved, the nal solution is heavily in uenced by the arbitrary initial pad assignment". For small instances of up to 589 gates and 301 I Os, PACT achieves between 6.5 and 23.6 percent reduction in total wirelength versus random I O placement, and between 1.0 and 9.7 improvement in path timing.
Chen and Marek-Sadowska 2 note possible weaknesses in PACT, e.g., that the chip boundary is represented by a linear array o f locations. The authors of 2 construct an initial I O placement b y annealing, using circuit structure and path delay constraints in the objective. The path timing penalty of a bad I O placement is estimated as being up to 10; no wirelength penalty data is given. A signicant improvement t o P A CT is given by 14 , who separately process oating inputs in performing the initial I O placement. When both methods are compared in the context of a wirelength-driven annealing placer, the method of 14 improves over PACT total wirelength by 42 for two o f t h e four cases reported with up to 412 cells and 59 pads. The work of Gao et al. 7 typi es top-down partitioning-based placers that deal with peripheral I O pads and core cells simultaneously, i.e., separate balance factors are maintained for I Os and core cells during the partitioning step.
A T AXONOMY OF I O AND CORE PLACEMENT METHODOLOGY 3.1. I O Regimes
We study the following abstract model of I O regimes for row-based designs.
I O cells must be placed exactly at pad locations, and any I O cell can be placed at any pad location.
No two I O cells can occupy the same location. For a design with P I O cells and a rectangular core layout region, we x pad locations: with P locations uniformly spaced around the boundary of the core layout region this models the peripheral I O regime with generic locations, with P locations along the sides of the core layout region determined by projecting" original peripheral pad locations 3 to the layout boundary with respect to the center of the layout region this models the peripheral I O regime with user locations, or with an array of locations spaced uniformly within the core layout region this models the area-array I O regime. 4 Our model of the area-array regime ignores many practical constraints, as well as freedoms, that are associated with pad location and I O assignment to pads. For example, we ignore I O placement constraints many of which h a v e y et to be precisely characterized that arise from package technology, decoupling and ESD requirements, power ground distribution, etc. However, our model permits controlled study of netlist embeddability and embedding strategy, and is tractable even to industry layout tools that do not otherwise handle the area-array regime. Within this model, we investigate the following taxonomy of methodologies and engines for I O and core placement.
Alternating vs. Simultaneous Methodologies
Previous works use two basic methodologies, alternating and simultaneous, for I O and core cell placement.
Alternating I O and core placement. The alternating methodology iterates between two basic steps, 1 xing core cells and re-placing I Os, and 2 xing I Os and re-placing core cells. There are two distinct variants, alternating core-rst and alternating I O-rst. In the core-rst variant, we delete all I O cells from the netlist, then place the core cells into legal core sites to minimize, e.g., a total wirelength objective. We then restore the original netlist, and begin the alternation with step 1. In the I O-rst variant, we determine an initial I O placement and then begin the alternation with step 2. As noted above, I O-rst alternation typically uses a random initial I O placement. In the peripheral I O regime, we can also have a user-de ned initial I O placement.
Simultaneous I O and core placement. The simultaneous methodology determines placements of I O and core cells at the same time. For example, 7 performs hierarchical top-down placement, and speci es that any p h ysical partition at any level contains assignable I Os in proportion to the numberofavailable legal pad sites contained in the partition.
Hierarchical Placement Engines
Most placement tools, with the notable exception of TimberWolf 18 , are reputed to incorporate the hierarchical top-down strategy as part of their approach. The two most common variants are based respectively on analytic placement and pure min-cut placement.
Analytic placement. As reviewed in 1 , the analytic placement popularly referred to as quadratic placement" approach i n v olves 1 solution of a sparse linear system to determine a continuous module placement that minimizes a squared-wirelength objective 5 , interleaved with 2 heuristic means of legalizing", or spreading", the placement so that it satis es discrete constraints e.g., no module overlaps. The sparse linear systems can correspond to hierarchical subproblems in the top-down placement process. The legalization step is typically accomplished via partitioning or assignment transportation formulations, along with highly sophisticated metaheuristics.
Pure min-cut placement. A pure min-cut strategy uses an iterative partitioner, e.g., some variant of the Fiduccia-Mattheyses heuristic 6 , with a minimum net cut objective. Techniques such as terminal propagation, table-lookup treelength estimators, etc. are used to improve delity of the min-cut partitioning objective with respect to the actual placement objective, which is often based on the sum of net bounding box halfperimeters. Again, many sophisticated metaheuristics are used to achieve competitive results.
EXPERIMENTAL QUESTIONS AND TESTBED
In most IC place-and-route ows today, I O locations are assumed to be xed in any input instance to the placement tool. For example, in hierarchical blocks the I O locations may be xed by a pin optimizer or by a chip-level route planner within the design planning tool. Even for at, fullchip" placement e.g., traditional gate-array ASIC, CAD engineers often heuristically assign system I Os to pad locations. The conventional wisdom is that xing the I O placement before placing core cells impacts the total wirelength of the layout by at most a few percent. This is plausible since the number of nets incident to I Os is small relative to the total number of nets, and matches the conclusions of 11 . On the other hand, recall that in the peripheral I O regime i a high-quality I O placement can signi cantly improve wirelength and performance for small designs 2 , and ii with alternating I O and core placement methodologies the nal solution is heavily in uenced" by the initial I O placement 13 . Thus:
Experimental Question 1: Does an initial I O placement h a v e as heavy" an in uence with respect to alternating methodologies in the area-array I O regime as it does in the peripheral I O regime? Next, we observe that within the hierarchical placement approach, the use of analytic placers quadratic placers" requires xed I O locations to anchor" the result of the continuous placement. If there are no prescribed I O locations, then the analytic placement collapses down to a single point and provides a vacuous start to the legalization step. We ask whether a x I Os rst" mindset with the analytic placement approach has somehow guided I O and core placement methodology in the wrong direction.
Experimental Question 2: Can core-rst alternation which a pure min-cut placement engine might achieve more naturally than a
Testbed
To test all placement strategies within the above taxonomy, we require four capabilities: 1 placement of core cells with I Os detached from the netlist; 2 placement of core cells with xed I Os; 3 placement of I Os onto discrete locations when core cells are xed; and 4 simultaneous placement of I O and core cells.
Our testbed includes both an industry placer and our own placer. The industry placer reputedly uses some form of the quadratic analytic placement approach; we run it in default mode. It can be coerced to provide the rst three out of the four capabilities above, via a mix of site denitions, xed-location constraints, I O-core reclassi cation, and scripting. Our placer integrates a number of techniques, but we i n v oke only pure hierarchical min-cut placement for core cells and min-cost assignment for I Os to achieve all four of the capabilities above.
Both the industry placer and our placer can read test cases from industry in the Cadence LEF DEF format. Our experiments are based on two industry designs from which we removed all nets incident to more than 100 cells, as well as all spare cells and the 1-pin nets that result from removal of spare cells. Parameters of the test cases are shown in Table 1 
EXPERIMENTAL RESULTS

Experiments With the Industry Placer
The industry placer, to our knowledge, assumes that I Os are placed and xed. We h a v e not yet been able to achieve a simultaneous methodology with this tool. To implement alternating I O and core placement methodologies, we m o dify the netlist at each iteration by relabeling former movable core cells as xed I Os, and former xed I Os as movable core cells. The placer can also be run with all I Os removed from the netlist to construct initial core placements. Notice that the placement of core cells with all I Os removed from the netlist gives an empirical lower bound" for wirelength. Such l o w er bounds are 246633 microns for Case 1 and 1602701 microns for Case 2. We place I Os and core 10 times each, in alternation, using the industry placer; this yields 19 iterations with meaningful placement results. 6 We report best wirelength values over all 19 iterations along with the iteration number 1-19 at which the best result occurred. We also report wirelength values obtained at the rst iteration. Two wirelength values are given: summed over all nets, and summed over only nets that contain I Os.
For the standard methodology of alternating I O and core placement steps starting with a random I O placement, surprisingly many iterations are required until the best placement is reached. Quite possibly, current practice does not invoke su ciently many alternations. Often, wirelength does not improve monotonically over successive iterations, but instead oscillates. This is shown in Figure 1 , which shows all results in the PG and AA regimes for both Case 1 and Case 2. The oscillating e ect is very strong with peripheral I O regimes, and relatively weak with area-array I O. Furthermore, the I O placement steps for peripheral I O usually increase wirelength, while core placement steps usually decrease wirelength. One possible explanation is that the industry placer's algorithm, while highly successful for core placement, does not perform as well for I Os. Thus, the I O placement iterations are essentially perturbations" or kick m o v es" from which the core placement recovers. We also note that the best wirelengths in all I Or experiments obtain from 22 to 46 reduction over the ini- tial wirelengths. We conclude that an initial random I O placement can be very harmful, and that the traditional methodology may require many alternations to recover.
We next observe that with every placement regime, placing core cells rst with I Os removed leads to overall smaller wirelength. Furthermore, in all but the PU,I Ou case, far fewer iterations are required before essentially the best wirelength is achieved. From a methodology point of view, it appears that disconnecting the I Os from the netlist and placing the core yields a superior start for alternation. Finally, total wirelength values show that the best I O regime is area-array, followed by uniformly-spaced generic peripheral. The worst regime in terms of achievable" wirelength was the PU regime: we speculate that the nonuniformity o f the pad locations makes wirelength minimization more difcult for the various methodologies. Notice that in the PU regime, the I O placement de ned by the designer leads to overall smaller wirelength, again con rming that the initial I O placement strongly in uences the nal placement and that a bad initial placement is harmful.
Experiments With Our Placer
With our placer, a simultaneous methodology is implemented by performing top-down hierarchical placement, with core cells recursively bipartitioned by a min-cut KLFM heuristic at each level, and I O cells re-placed by min-cost assignment at each level. To set the assignment costs for the I O cells, we assume that each core cell is located in the center of its current region within the top-down process. 7 Our implementation of alternating methodologies uses topdown min-cut FM bipartitioning for each core placement iteration, and min-cost assignment for each I O placement iteration. With our placer, all reported results are averages over 5 runs. Table 3 . Studies of peripheral and area-array I O regimes. Table 3 shows that as the total wirelength improves from the rst iteration to the best iteration, the same reduction or even greater reduction occurs in the nets incident to I Os. This phenomenon appears to be independent o f I O regime and placement methodology. As expected, using min-cost assignment for I O placement results in much more e cient I O placement iterations, and generally decreases wirelength from the previous iteration in contrast to what we observed with the industry placer. 8 As seen in Figure 2 , wirelength decreases faster than with the industry placer. Furthermore, near-best wirelength results tend to beachieved in fewer iterations sometimes almost immediately. We believe that this is due to use of assignment for I O placement.
In general, alternating methodology experiments with our placer lead to similar qualitative conclusions as experiments with the industry placer. We emphasize that our wirelength values and CPU times are not comparable with those of the industry placer, since the industry placer performs many detailed optimizations to guarantee routability. We leave such optimizations, and the ability t o p e rform direct comparisons between the two placers, to future work. Finally, w e note that the results from the simultaneous methodology are fairly disappointing. We believe that variant implementations may be more successful, and our future work targets these. 
Experiments with Netlist Structure
Our nal experiments seek a link between topological structure of netlists and the ability of area-array I O to reduce total wirelength. We conjecture that designs in which core cells are nearer I Os will bene t greatly from area-array I O, while designs in which core cells are further from I Os will not bene t as much. The intuition is that area-array I O o ers more exibility in the I O placement, and that this can help a design whose core cells are tightly bound to I Os. On the other hand, if the core cells are loosely bound to I Os, then extra exibility in the I O placement m a y not be needed to achieve l o w wirelength.
To verify this conjecture, we have produced a series of mutant netlists by 1 iteratively nding the core cells furthest from I Os using simple hop-count in the netlist hypergraph, and 2 transforming these cells into I Os". In this way, we may change the topological depth fairly quickly with minimal change to the netlist. Table 4 shows the topological pro le, i.e., the number of cells at each hopcount level from the I Os, for the each original design and ve m utants selected for their variety of topological depths. Each m utant is designated by the number of cells changed from the original e.g., Case 1m40 has 40 original core cells changed into I Os. For each m utant, we perform the same placement experiments as before, but apply only the most successful methodology, i.e., alternating core-rst. These results are shown in Table 5 .
To assess the impact of the area-array regime, we observe the di erence between best wirelengths achieved with areaarray I O and with peripheral I O. For the smaller design, Case 1, this di erence clearly increases as more cells in the design move closer to the I Os. However, for Case 2 only minor changes in this di erence can be seen. We believe that alternate characterizations of netlist topology may b e needed, as well as more detailed studies.
CONCLUSIONS AND FUTURE WORK
In this paper, we h a v e empirically studied the implications of area-array I O for placement methodology using a testbed that allows us to vary the I O regime, the I O and core placement methodology, and the placement engine. Our main results are as follows.
We con rm that with alternating placement methodologies, which are often used in practice today, the number of iterations needed to achieve good solutions can be surprisingly large. Also, a bad e.g., random initial I O placement can seriously handicap subsequent iterations. We also conclude that it is better to begin the alternation by placing a netlist of core cells with all I Os disconnected, rather than by placing I Os randomly as in traditional approaches. This has interesting implications vis-a-vis the use of quadratic placers, which can require xed anchors" in the placement instance. Our experiments also show that the area-array I O regime o ers substantial wirelength improvements over the best methodology we found for the peripheral I O regime. The wirelength reductions range from 12-27 with an industry placer, and average around 30 with our placer. We show that the classical assignment problem may be a more appropriate framework for the I O placement iteration. We observe that the assignment formulation accurately captures placement costs since I Os rarely share nets, and that assignment algorithms seem more e cient than traditional placers on I Os-only instances. Furthermore, traditional placers may not easily adapt to I O placement instances, and may even increase the total wirelength in some iterations. Table 5 . Studies of peripheral and area-array I O placement regimes with netlist mutants.
Our ongoing work addresses three issues. First, our experiments have shown that our combination of min-cut and assignment m a y e v entually be competitive with the industry placer, at least in situations where both I O and core cells must be placed. Again, recall that our approach d o e s not require any xed anchors, and that the rst iteration appears very important. We hope to improve our placer to meet similar routability and legalization objectives as the industry tool, so that more direct comparisons will be possible. Second, we believe that the simultaneous I O and core placement methodology still holds promise, and we hope to nd a better implementation. Finally, we hope to better understand the relationship between netlist topology and wirelength reductions a orded by the area-array regime.
