ABSTRACT This paper recommends a practical way to insert buffer and flop repeaters onto global signals of a complex system-on-chip (SoC). With the advent of deep sub-micrometer technology and new business environment, the market prefers a highly integrated SoC with fast design productivity and low development cost. We observed that many algorithms proposed in the prior arts, which used exact algorithms to optimize design solely for performance, power, and/or area, are no longer practical. With that, we introduced a hybrid meta-heuristic based flow, which combines meta-heuristic algorithm with different artificial intelligent algorithms such as exact and heuristic algorithms, to search for a near optimum buffer repeater insertion recipe, optimize the floor-plan pin placement, and then correctly insert flop repeater into the design. Our experiments on 10-and 14-nm SoC products showed that the flow managed to produce a ''good enough'' quality of repeater designs with fast turn-around-time and less design effort.
I. INTRODUCTION
In deep submicron design, the interconnect delay has become one of the dominant performance limiting factors to a circuit. Thus, many interconnect optimization techniques have been introduced including buffer and flop repeaters insertion. For buffer repeater optimization, some researches [1] , [2] had introduced the concept of feasible regions or also known as buffer blocks. In previous work [2] , the team have also derived the analytical formula to compute the buffer blocks under any given delay constraint. However, the formula did not count in signal transition which is crucial for high signal integrity. Then, researchers [3] presented a buffer block planning algorithm that meets both the delay and transition time but it only works on a 2-pin net. After that, researchers [4] have improved the idea to cover nets with multiple pins given an existing buffer block plan. The abovementioned techniques provide good buffer repeater results based on a set of user defined parameters such as wire resistance, intrinsic repeater delay, repeater input and output capacitance. However, in recent technology, the parameter can be of many different values. For an example, wire resistance varies depend on the wire layer, width and length, spacing, and repeater driving strengths. Thus, searching for the right parameters itself can be a challenge. Different design with different reliability requirement and operating clock frequency may require different repeater design recipe. It is a tedious task to find an optimum repeater recipe as explained by researcher [5] in his work for 32nm technology. For that reasons, we have introduced a hybrid meta-heuristic technique which uses machine algorithm to automatically learn heuristically and then search for a near optimum recipe.
For interconnect delay that is more than a clock cycle, flop repeater is needed as explained by researchers [6] - [9] in their work which is related to buffer and flop repeater optimization for performance and power consumption. However, they have used only the simplified and ideal models in their analysis. In a real complex SoC design, there are more factors to be considered. For examples, multiple clock domains concern [10] , timing margin management [11] , and Bit Error Rate (BER) consideration [12] . In an actual SoC product development, to achieve the optimum interconnect design, knowledge on circuit level analysis alone is insufficient. We have to analyze the product at higher level such as floor-plan and global route congestion as proposed by researchers [13] , even up-to architecture and computer aided design (CAD) tool development levels as explained by researchers [14] . However, the abovementioned techniques used are mainly the exact algorithms, which guaranteed to find for every finite size instance of a combinatorial optimization problem an optimal solution in bounded time [15] , but are impractical for a complex SoC case today. As we are now at the phase 3 in which a completely new ecosystem emerged during the past decade as reported by International Technology Roadmap for Semiconductors (ITRS) 2015 [16] . With the emergence of fabless companies, the business becoming more competitive as companies have to respond to rapidly changing, complex business requirements, and increased expectation by customers for faster delivery of new and high volume products [16] . Therefore, in the current environment, the development cost and schedule have become key factors to the success of a SoC product.
In this paper, we present a repeater design methodology that is practical for a complex SoC development today. It can produce results with good enough quality, fast turn-aroundtime, and reduced engineering resources. This is achieved by using simplified design models and hybrid meta-heuristic techniques to minimize the human trial-and-error iterations. The rest of this paper is organized as follows: In Section 2, we explain the challenges of repeater design in a complex SoC in the current business environment. In Section 3, we explain the concept of hybrid meta-heuristic and how it is being used in the proposed new repeater design flow. Section 4 elaborates the hybrid meta-heuristic technique used in searching for the optimum buffer repeater recipe. In Section 5, we introduce the proposed new hybrid metaheuristic based flop repeater insertion flow and then in Section 6, we will prove the efficiency of the flows on real SoC products. Finally, Section 7 concludes the paper with some directions for future works.
II. REPEATER DESIGN CHALLENGES
The common known challenges of repeater design are:
• Latency -Pipelined interconnects are designed so as to minimize the overall latency of the propagated signals or to satisfy latency constraints given at each driverreceiver pair [8] .
• Power -The power consumed by interconnect including repeaters and flip flops gains a growing significance in the total system power [9] . Thus, the power dissipation has become a primary design constraint.
• Signal Reliability -In real circuits, there are many nonideal behaviors of circuits and signals due to temporal and spatial variations of clock signal (clock skew and jitter), wire delay uncertainty, and variations of timing parameter of sequential elements, which will greatly impact the reliability of wire pipelining scheme [17] .
• Area cost -It was projected that over 700k repeaters will be inserted in a single chip for the 70-nm technology.
The insertion of that many repeaters will significantly change the floor-plan and placement of a design. Many designs are routing-limited; it may not be feasible to get signals to and out of repeaters due to the limitation in routing resources. Therefore, it is important to consider routing feasibility during the global distribution of repeaters in the floor-planning step [18] . However, in the recent business environment, the following challenges have emerged as primary concerns:
• Design productivity and development cost -As listed in challenges and possible solution Section of ITRS 2.0 2015 [16] , the key system integration long term (> 3 years) challenge is design productivity where it is beneficial for faster design turn-around-time and less design effort.
• Design Complexity -As mentioned by researchers [19] , in the 2013 ITRS roadmap, the die area of Consumer Portable SoC (SOC-CP) is about 140mm 2 and transistor count is about 2.4B transistors. Recent trends suggest a factor of 1.26× scaling of logic transistor per core with every process technology node. In a complex SoC, we may need both buffer and flop repeaters. Buffer repeater insertion can be solely implemented and validated by physical designer, but flop repeater insertion requires additional parties such as micro-architecture designer, floor-planner, timing and clock designer. On top of design, the amount of effort put into communication to converge a SoC cannot be under looked. In this paper, we focus on developing a methodology that can solve the common known challenges listed above with fast turn-around-time and less design as well as communication efforts.
III. HYBRID META-HEURISTIC TECHNIQUES
As described in Section II, repeater design of a complex SoC in the recent business environment is a non-deterministic polynomial-time hardness (NP-Hardness) optimization problem, where no algorithm exists to solve it in polynomial time. The usage of the exact algorithms, such as mathematics based approaches, to give exact or complete solutions is impractical. In approximation or namely heuristic methods, the guarantee of finding optimal solutions is sacrificed in order to get near-optimum solutions in reasonable and practical computational times [15] . In repeater insertion context, an example of heuristic method is to configure constraints of a layout CAD tool and then rely on its auto place and route (APR) algorithm to build the solution. APR is a local search algorithm which has no means by which it can search extensively and exploit global and optimum solutions found in the neighborhood existing in distances of the solution space [20] . Therefore, local search algorithm will simply get caught in the local optima [15] and to solve this problem modern explorative meta-heuristic techniques have been proposed. According to researchers [21] , a meta-heuristic is formally defined as an iterative generation process which guides a subordinate heuristic by combining intelligently different concepts for VOLUME 6, 2018 exploring and exploiting the search space. Learning strategies are used to structure information in order to efficiently find near-optimal solutions. There were other researchers attempted to propose a definition as well. Summarizing, researchers [22] outlined fundamental properties which characterize meta-heuristics in the literature of their work and proposed that meta-heuristics are high level strategies for exploring search space by using different methods. In the last few years, the literature reports [20] that a huge number of algorithms have not followed the conventional single metaheuristic strategy but have combined different algorithmic ideas from both the meta-heuristic as well as outside the meta-heuristic field. Such an algorithm is called as a hybrid meta-heuristic algorithm [23] . Therefore, these algorithms compose exact or heuristic or meta-heuristic strategies with some other meta-heuristic strategies in an attempt to remove each other's weaknesses and merge their strengths.
The proposed repeater design can be divided into two categories that are buffer repeater and flop repeater. In buffer repeater design, we will use Genetic Algorithm (GA), which is a type of population based meta-heuristic algorithms, to search for near-optimal design recipe such as the first repeater size, middle repeater drive strength, repeaters interval distance, metal layers, metal width, spacing, and shielding to produce the near optimum latency with acceptable area, signal integrity, and power tradeoffs. Then, we will deploy heuristic method to statistically estimate the performance of the recipe during actual physical implementation. The buffer repeater hybrid meta-heuristic flow will be elaborated in Section 4. The statistically estimated buffer repeater performance have been used in the floor-planning, global route congestion modeling, micro-architecture fine tuning, and CAD flows development such as placement in APR and flop repeater insertion.
The proposed flop repeater design flow is a hybrid metaheuristic based flow and will be elaborated in details in Section 5. For high level description, the flow uses metaheuristic approach to optimize pin placement, heuristic approach to implement channel physical blocks, aka partitions, which are the larger version of buffer blocks [1] - [4] and design the feed-through pins. Based on latency per distance data estimated by the hybrid meta-heuristic flow explained in Section 4, it will use exact algorithm to calculate the pipeline stage count for each feed-through signal in the channel partitions. The output from the flow will eventually be fed-back to micro-architecture designers to be modelled into the design. The key advantages of this flow is that it speeds up the turnaround-time and minimize the engineering efforts, as it allows flop repeater design to be performed by floor-plan and microarchitecture teams, hence skipping the involvement of timing, clock, and physical implementation teams.
IV. BUFFER REPEATER INSERTION METHODOLOGY
Buffer repeater recipe in this paper refers to combination of input parameters such as the first buffer repeater size, buffer repeater drive strength, repeater interval distances, wire metal layers, wire width, spacing, and shielding. Among these parameters, the repeater interval distances are not combinatory. In the prior works [2] - [4] , [6] - [8] , [11] , [18] , resistance and capacitance of wires and cells, which are depending on the buffer repeater recipe, have to be predefined by designers for the exact algorithm to find the feasible region for repeater insertion and the optimal interval distances in order to minimize latency and power consumption. However, there are many combination of the repeater input parameters. With a fixed value of these parameters, the exact algorithm can only produce a locally optimized result. In work [5] , a heuristic algorithm, which builds Resistance-InductanceCapacitance (RLC) circuits and then verified using Simulation Program with Integrated Circuit Emphasis (SPICE), is used to find an optimum repeater recipe for 32nm technology. The algorithm is experimented on circuitries for three repeater interval distances only. The optimal buffer recipe is identified through the plotting of Pareto points on the charts of propagation time versus buffer strength. In reality, the repeater interval distance is not a combinatorial value, thus the results of the analysis may not be global optima.
To identify a better buffer repeater recipe, we will need to search in a wider space, in which means more combination of input parameters. Therefore, we have deployed GA, a type of population based meta-heuristic algorithm. GAs are search methods based on principles of natural selection and genetics. GAs encode the decision variables of a search problem into finite-length strings of alphabets of certain cardinality. The strings which are candidate solutions to the search problem are referred to as chromosomes, the alphabets are referred to as genes and the values of genes are called alleles. As described by researchers [24] , GAs rely on a population of candidate solutions. The population size, which is usually a user-specified parameter, is one of the important factors affecting the scalability and performance of GAs. To evolve good solutions and to implement natural selection, we need a fitness measure for distinguishing good solutions from bad solutions. Once the problem is encoded in a chromosomal manner and a fitness measure has been chosen, we can start to evolve solutions to the search problem using bioinspired operators such as reproduction, selection, mutation, and crossover.
In the proposed GA technique, we model a global signal implementation into three parts: driver, trunk, and receiver as shown in Fig. 1 . The recipe for each part is modeled in Chromosomes strings, which include the genes that represent the first buffer, first distance, repeater cell(s) drive strength, interval distance, wire layer, wire width, wire spacing, and shielding. The fittest chromosomes, is the combination of the three sub-chromosomes strings that, without violating the maximum transition (Trans) and Maximum capacitance (Cap) requirements, produces the shortest latency and consumes reasonably low power.
The proposed hybrid meta-heuristic based buffer repeater insertion flow is shown in Fig. 2 . Basically, a GA metaheuristic algorithm is used to search for a near-optimal design recipe and followed by a heuristic SoC implementation flow to fine tune the recipe. The proposed initial population generation flow is as modeled in Fig. 3 . In the flow, we developed scripts to automatically mix and match the genes options to produce chromosomes. For each chromosome, a simple repeater layout, which consist of the buffer repeaters design as described by the chromosome, will be automatically generated using APR CAD tool. After that, we will extract the parasitic resistance-capacitance (RC) of the design and perform static timing analysis (STA).
In the initial reproduction flow, we choose to operate on a 1000um * 1000um floorplan, which is an intended trade-off of accuracy for the turn-around-time. The trade-off of accuracy, which is due to non-optimized receiver part, will be rectified using heuristic flow. On top of that, to reduce the size of the initial population, we have preset some of the input values based on know-hows learnt from the previous SoC design experiences. For examples, we combine the first and last stage buffers into the same set and remove the big cells from the set to reduce the risk of placement congestion near to the pins. We set the granularity of the trunk interval distance to 25um to reduce population size and the shielding to always ON to mimic the coupling capacitance effect to improve accuracy in STA.
To simplify the implementation flow development, we set the driver interval distance to half of the trunk interval distance and the wire used in the driver, trunk and receiver to the same metal layers. With incorporation of these know-hows, the Chromosomes string has been shortened into the format of {a}.{c}.{i}.{l}.{w}.{s}.{c r }, where the first/last cell strength a A = {bfnln06, . . . , bfnln32}; Trunk Cell strength c C = {bfnln04, . . . , bfn2n72}; Interval distance i I = {100, 125, . . . , 250}; Wire layer pair l L = {(m4 m5), . . . , (m10 m11)}; Wire min Width multiplier w W = {1, 3}; Wire min Spacing multiplier s S = {1, 2, 3} and the N-1 cell strength c r C = {bfnln04, . . . , bfn2n72}. In the initial population flow, to simplify the implementation, we set c r = c.
In the STA process, based on the location of the inputoutput port pair, we can identify the direction of the signal paths by either vertical (V) or horizontal (H) and then differential the paths by layer. With a simple script, for each path I, we measure its Manhattan distance M i and latency D i from driver-pin to the input of its N-1 cell, skipping the receiver part, then calculate the latency-per-distance
Upon completion of the STA process, we save the reports into a format which represent their respective Chromosomes string as shown by samples in Table 1 . Then, the table will go through a rank-based selection [25] and a fitness evaluation flow. First, we sort the latency in descending order to identify the Chromosomes with the lowest latency. Then, in ascending order, the lowest members will go through a Fitness Function, in which Chromosomes that failed in the maximum transition (Trans) and maximum capacitance (Cap) checks will be filtered out. The first candidate who passes the fitness function will be the overall winner among the initial population and its Chromosomes will be used for Mutation operation in the subsequent optimization. Different SoCs will have different priorities on latency, power, signal reliability and area. For example, in a low power SoCs, power consumption will be relatively higher weightage factor in the rank-based selection. To solve this multiobjective problem, we can use a simple method to aggregate all the criteria into one criteria using a weighted summation [26] , that is Cost S c = m k K + m p P + m s T + m a D, where K is the latency-per-distance; P is the power consumption; T is the worst signal transition; D is the cell area; m k, m p, m s and m a are user defined weightages as per the SoC priority with all m > 0 and m k + m p + m s + m a = 1. In the SoCs used in this paper, we simplified the selection process by setting m p = m s = m a = 0. This is because we have considered maximum transition in the proposed fitness function and filtered out any high power consumption and large cells from the search space during the initialization of population. Controlling the solution selection using maximum transition and applying side shielding to signal wires are the key techniques we use to solve signal integrity challenge.
In a complex SoC design that has many global signals, routing tracks could be one of the key factors to die area growth. Thus, using mainly single metal layer for all the global signals is not a practical solution for a cost sensitive SoC. As there are global signals with different clock frequencies, we can optimize die area by using different metal layers as long as its total latency is still within the margin. Therefore, beside the overall winner, we also identify winners for different categories, in which the winners are used for different purposes. First, we categorize the population by wire min width multiplier, w W . The winner of category w = 1 is meant for default data signal routing whereas category w = 3 is meant for critical clock signal that will be custom built. The category of min width multiplierw = 1, which is meant for data signals, is further categorized by each wire layer l L m = {m4, m5, . . . , m10, m11} and captured into a format as shown by examples in Table 2 . This data is meant for floor-planning, similar to the objective discussed by researchers [13] , and data signal pipeline designs as elaborated in Section 5. On top of that, the winners for each interval distance i I are rank-based selected by wire layer l L and captured into a format as shown by examples in Table 3 . The table is used for Crossover operation later to find the optimum solution for receiver part of a global signal. In this selection process, we update the fitness function to ensure the trunk cell strength c C is be equal or smaller than cell strength c of their respective winner in Table 2 . To search for the local optimum point, we can deploy the GA Mutation operation on the overall winner's 46338 VOLUME 6, 2018 Chromosomes. In the operation, only the repeater interval distance genes i I of the Chromosome will be mutated both incrementally and decreasingly with finer granularity of 5um. Based on the generated offspring's Chromosome, the simple layouts have been automatically generated, extracted, and analyzed through another round of rank-based selection flow and Fitness Function to identify the winner, which is effectively the Pareto point of the latency versus repeater interval distance plot. This Mutation operation can be applied on the winners for different wire layer l L to optimize the value in Table 2 .
We depict that a receiver interval distance i r , which is the distance between N-1 and N cells as shown in Fig. 1 , is depending on the total path distance l t from driver pin to receiver pin and the trunk interval distance i as i r = ((l t -0.5i)mod i) * i.. In a SoC, different global signals can have different total path distance l t and thus different receiver interval distance i r. Therefore, we can crossover the N-1 cell strength gene c r C of its chromosomes string with Chromosomes in Table 3 to further optimize the latency of the global signals. In practical, this crossover process shall happen after the global signals have been implemented. By matching the wire layer pair l L and approximating the receiver interval distance i r of each global signal to the trunk interval distance i of the Chromosomes in Table 3 , the N-1 cell strength c r are crossed over with engineering change order (ECO) process.
For a global signal with multiple fan out, we will preprocess the signal by splitting it into multiple single ended signals using script as illustrated in Fig. 4 . With that, the Chromosomes of single ended signal found can be re-used on them. For data signal that has very high fan-out, the script will build a multi stages tree but this may increase the latency significantly. Thus, for the very high fan-out signal, which is usually multi-cycle path that does not need to be optimized for latency, we implement it using other technique, which will not be covered in this paper. The look up tables (LUTs) in Fig. 2 are essentially Table 2 and Table 3 which we can use to generate the default repeater recipes for a design using a heuristic SoC implementation flow. The repeater recipes are referred to the configuration of the CAD APR tool for buffer insertion and reference for custom circuit design for critical clock signals in this paper. With the heuristic flow, we can rectify the latency per distance K (ps/mm) value of Table 2 to cover the inaccuracy of receiver modeling abovementioned and any routing overhead added by the CAD tool. One of the key overheads is the usage of lower metal to solve routing congestion. An example of the heuristic results is shown Fig. 5 , where 18 thousands data points are extracted from one of the channel partitions in a 10nm SoC. The global signals in the channel partition have been implemented using Chromosomes with L = m4in Table 2 . From the data, we depict that the minimum latency per distance K (ps/mm) at 1000um distance is correlating with the 786.6ps/mm value in Table 2 . However, for actual implementation, we should use the average value of the latency per distance K instead. From the heuristic data, after filtering out the outliers caused by routing congestions, the latency per distance K (ps/mm) for Chromosomes with L = m4 is rectified to 1000ps/mm.
FIGURE 5.
An example of heuristic data extracted from 18 thousands buffer repeater paths implemented using Chromosomes with layer L = m4. It is used to rectify the latency-per-distance K (ps/mm) value in Table 2 .
V. FLOP REPEATER INSERTION METHODOLOGY
An overview of the proposed new hybrid meta-heuristic based flop repeater design flow is shown in Fig. 6 . The flow is meant for a complex SoC that has many global signals to be flop repeated, up to the situation where manual handling is no longer practical. First, we will read LUTs, as shown in Table 2 , into full chip (FC) floor-planning process. In the process, for each VOLUME 6, 2018 block-to-block path group i, for example in Fig. 7 :
We estimate the average horizontal h i and vertical v i distances of its Manhattan route and then calculate the latency Table 2 . With that, we can fine tune the relative placement of the blocks to meet their respective timing requirements.
In this paper, we use only the exact and heuristic algorithms. As our meta-heuristic technique based FC floor-plan flow, which can be an enhancement to work [27] , is still work in progress. After that, we will calculate the width of each channel W i = K p H i , where K p is a constant of routing-trackper-um for the particular process node after minus off the tracks allocated for power straps whereas H i is the signal count of the highest routing density point on the channel. For the example in Fig. 7 ,
Based on the value of W i, we can adjust the channel width accordingly. Another alternative way to optimize for area is by altering the shape of the blocks to produce rectilinear channels but at the cost of engineering effort. After the floor-plan initialization with either virtual flat, black-boxed or reduced net-list, by using CAD tool, we heuristically perform global route, push down of signals into partitions as feed-through, and creation of the signal pins at partitions edge. We have developed an automation script to extract the initial pin locations and modeled them into LUTs. With the LUTs, we can maintain similar pin locations across different design versions and even for different design input format such as flattened, reduced, or black-boxed netlist. On top of that, we optimize the pin placement LUTs by using mean shift clustering which is also a meta-heuristic algorithm. Mean shift is a simple iterative procedure that shifts each data point to the average of data points in its neighborhood as described by researchers [28] . We use mean shift algorithm to reduce efforts in the visual inspection of the pin clustering results and manual editing of pin constraints.
The proposed mean shift algorithm based new metaheuristic FC pin optimization flow is shown in Fig. 8 . Basically, the flow optimizes pins location through iterations of constraints generation, fast heuristic placement, quality checks and then rank-based selection [25] against the previous best-known constraints. From the ranking, the better constraints will be logged and used for pin placement in the next iteration. In the case if the new constraints failed to produce a better result, the flow will restore the previous best-known constraints and run a mathematic calculation based constraints optimization script to clear if any potential deadlock. The script manages to filter out outlier data points such as wrong edge pins and align the pin constraints between blocks that facing each other directly. Then, the flow will proceed with the new pin constraints and iterates until user defined loop count is met. Fig. 9 shows a model of two pin clusters, going through three iterations of simple mean shift flow and eventually converged into a better distribution. In this flow, input sets are the pin clusters, which are clustered based on the timing path and bus name on single axis. The data points of pin clusters are finite numbers. Therefore, instead of finding the highest density point, we calculate the simple mean of every pin clusters at once. Then, based on the new mean values, we will regenerate the new pin constraints which we will use to heuristically regenerate new pin placements. Through multiple iterations of the mean shifting and constraints regeneration cycle, we can converge the pin placement and the final pin placement will be extracted into pin constraint LUTs which will be used in the actual heuristic SoC pin placement flow.
The pin clusters are labeled with a specific naming convention, which combines the signal driver partition name, the receiver partition name, and the bus name together. In the proposed pin constraint LUTs, each pin cluster will be modeled with two variables, which represent their respective partition edge and offset range. For the pin cluster example illustrated in Fig. 10 , the variables will be: set <driving_partition>_2_<receiving_partition> _<bus_name>_edge 4.0 set <driving_partition>_2_<receiving_partition> _<bus_name>_offset {Xmin, Xmax} With
where n is the total number of pins in the cluster, X mean is the simple Mean value of the axis-coordinate whereby in this example it is the x-coordinate of all the pins in the cluster, and K p is a constant of routing-track-per-um that is process node dependence.
In the proposed pin quality checks step shown in Fig. 8 , we are using a formula derived from the standard deviation of a pin frequency histogram. For the pin distribution of a cluster as illustrated in Fig. 11 , we can bin the pins for every 5 um from the x mean and plot a histogram of pin-number-perbin versus pin distribution distance as in Fig. 12.   FIGURE 11 . The model used to explain the formula used in the proposed simplified density quality checking step. From the histogram, we can calculate the standard deviation σ as in (1) . If a distribution is more spread out, then the deviation will be bigger.
However, other than the spreading of pin, the pin cluster with larger pin counts will have larger standard deviation σ value than smaller cluster even both of them have an optimized pin placement. Therefore, to assess the quality of a pin placement, we will calculate the average distance (um) per pin. Based on normal distribution 68-95-99.7 three sigma rule [29] , probability P(x mean -3σ ≤ x ≤ x mean + 3σ ) ≈ 0.9973. Thus, we can calculate the average distance-per-pin for cluster j with (2):
In the proposed pin quality checks as in (3), any cluster with distance-per-pin A j more than double of the 1/K p , where K p is a constant of routing-track-per-um, is considered nonoptimized and will be logged into a file for deadlock debugging. Most of the deadlocks are due to large congestion hot spot. To clear the deadlocks, we have to pause the cycle and manually edited some of the LUTs values before continue.
As shown in the meta-heuristic flow in Fig. 8 , to decide whether to keep or discard the current constraints, we will calculate the total distance-per-pin A total value of all the clusters as in (4) and rank A total against the Golden Total distance-perpin A golden of the previous best-known constraints as in (5) . The current constraints is considered better if A total is smaller than A golden . In that case, A total value will be logged as the new A golden and same as the LUTs. 
Golden LUT = Current LUT; Atotal ≤ Agolden Golden LUT; Atotal > Agolden (5) VOLUME 6, 2018
After the flow finishes as per user defined loop counts, the best-known LUTs logged is ready for deployment in the actual FC pin placement. The LUT is reusable across different register transfer level (RTL) versions. We depict that the proposed mean shift based pin optimization flow focuses on optimization of bus pins. For the remaining pins, such as single bit, source synchronized, high-fan-out, tied-high/ low, and floating pin, their placement constraints are handled using conventional global route based SoC design planning methodology.
After pin placement optimization as shown in Fig. 6 , we will proceed with initial flop repeater, aka pipeline, insertion flow. With a reasonably good bus pin grouping, we can reduce the logic error rate in the pipeline insertion and consequently minimize the engineering efforts in analysis and filtering of the pipeline insertion results. As an overview, the initial pipeline insertion flow, as shown in Fig. 13 , takes in three different LUTs, which will be explained later, as references to process the current FC floor-plan and eventually produce a LUT which contains the information needed for pipeline RTL modeling. Prior to the pipeline insertion step, the FC floor-plan should have channel partitions inserted, as shown in Fig. 14(b) , with feed-through signal pushed down and pin properly grouped based on the pin constraints LUT. To enable a more accurate pipeline stage calculation, we need a pins-vs-frequency LUT from register-transfer level (RTL) designer. Each line in the pins-vs-frequency LUT is following the format of <path type={sync,async}>, <Bus name>, <Bit Count>, <Driving Pin Name>, <Receiving Pin Name>, <Driving Clock domain>, <Receiving Clock Domain>, <Driving Clock Freq.>, <Receiving Clock Freq.>, which models the timing and connectivity of a top level signal. After gathering of a complete pins-vs-frequency LUT, we use the algorithm in Fig. 15 to automatically design the pipeline. First, for each line in the LUT, the algorithm traces the connectivity to identify physical blocks that the signal will be feeding through, including channel and functional blocks. Then, for each partition that the signal has fed-through, it will calculate the Manhattan distances and followed by the total signal latency with referring to the latency-perdistance LUT as in Table 2 , which is generated by the buffer repeater insertion flow in Section 4. After that, depending on the <path type={sync,async}>info, the algorithm will proceed with the pipeline stages calculation if the signal is a sync (synchronized) path or skip if it is an async (not synchronized) path. During pipeline stages calculation, the algorithm will calculate the timing margin for each pipeline stage. With that, the algorithm can calculate the minimum pipeline stages needed for the signal in each partition it has fed-through. After that, the flow will write out the result into a pipeline stages LUT with the format of <path type={sync,async}>, <Bus name>, <Bit Count>, <Driving Pin Name>, <Receiving Pin Name>, <Driving Clock domain>, <Receiving Clock Domain>, <Driving Clock Freq.>, <Receiving Clock Freq.><b 1 N b1 > <b 2 N b2 > . . . <b i N bi >, which is basically the combinational of input line i from Pins-vs-Frequency LUT with S i string that represents the feeds-through sequence of a signal and the number of pipeline stages N bi in each partition b i B i that the signal has fed-through.
Upon completion of all the signal listed in the pins-vsfrequency LUT, we will use simple scripts to parse and edit the pipeline stages LUT to ensure signals that belong to the same <Bus name> are having the same pipeline stage count and feed-through path. For sanity check, <Bit Count>is used to cross check against the number of lines in the output pipeline stage LUT for <Bus name>. Besides, we also check the output LUT against the special signal LUT which is provided by architect to highlight the critical signals that have stringent pipeline stages limitation due to system performance requirement. The proposed script will automatically scale down the number of pipeline stages of the critical signals to meet their respective design requirements and at the same time write out special metal layer constraints for those signals to be loaded into APR for correct implementation. Any critical signals that are not able to meet architect's special requirements will be flagged and logged into error report file for debugging. Finally, the pipeline stages LUT will be consumed by the flow as shown in Fig. 16 , in which the LUT will be used to model the pipeline stages in RTL, synthesized into gate level netlist, and then physically implemented using the conventional Heuristic SoC Convergence flow. 
VI. RESULTS
To prove the effectiveness of the proposed new hybrid meta-heuristic technique based buffer repeater insertion flow, we carried out an experiment on a taped-out 14nm SoC which has been implemented with only the conventional heuristic SoC flow. We have performed the proposed GA technique to identify the Winner Chromosomes and prepared LUTs for the 14nm nodes. Based on the LUTs, we have revamped the buffer repeater insertion of its global signals using the same conventional heuristic SoC flow and then compared the new latency results with the original one. The path-to-path latency comparison data is shown in Fig. 17 , where we depict that the proposed new technique has managed to further improve 43% of the total paths. Table 4 shows the effectiveness of the flow in stabilizing the pin location across two RTL design versions, whereby less than 9% of pins have moved beyond 250um and most of the pins that moved more than 450um are intended.
We have deployed the hybrid meta-heuristic based flop repeater insertion flow in Fig. 6 on a very complex 10nm SoC which has more than 100k global signals that require flop repeaters. Table 5 shows the results of the flop and buffer repeater insertion after the default iteration of APR flow before involving engineer in debugging and optimiza- tion work. From this case study, we depicted that the flow correctly inserted over 500 thousands flop repeaters but less than 0.6% of the total paths need further optimization and the worst negative slack (WNS) and total negative slack (TNS) are fixable with just physical design optimization. Besides, the whole flop repeater insertion process including flow development is completed by a team of five engineers including two RTL and three physical designers within two months, saving at least 30 man months of efforts compare to the previous project.
VII. CONCLUSION AND FUTURE WORK
In this paper, we present a new buffer and flop repeater insertion methodology for a complex SoC with a hybrid metaheuristic technique. We used meta-heuristic algorithms on simplified design models to quickly find a near optimum recipe, then use heuristic algorithms on realistic design samples to rectify the recipe by statistically model in the unobvious overheads, and finally exact plus heuristic algorithms to filter outliers and accurately apply the recipe into the actual designs. With this technique, we manage to insert repeaters into two SoCs in 14nm and 10nm technology nodes with a ''good enough'' quality at low engineering cost and turnaround-time.
Beside repeater insertion, we are pursuing extensions of the hybrid meta-heuristic technique into different areas in a complex SoC physical design, especially in areas which require extensive engineer effort and time in searching for a converged solution, such as multi-hierarchical hard macro placement, full chip clock distribution design, and multi hierarchical floor-plan optimization for timing convergence.
