Evolutionary algorithms can outperform conventional simulated annealing placement as well as manual placement on metrics such as runtime, wirelength, pipelining cost, and clock frequency when mapping FPGA hard block intensive designs such as systolic arrays on Xilinx UltraScale+ FPGAs. Such designs can take advantage of repeatable design organization of the arrays, the columnar arrangement of hard blocks such as DSPs and RAMs, coupled with cascade nearest-neighbor interconnect for systolic-friendly data movement. However, the commercial-grade Xilinx Vivado CAD tool is unable to provide a legal routing solution for such hard block intensive designs without tedious manual placement constraints. Instead, we formulate an automatic FPGA placement algorithm for these hard blocks as a multi-objective optimization problem that targets wirelength squared and maximum bounding box size metrics. We build an end-to-end placement and routing flow called RapidLayout using the Xilinx RapidWright framework. RapidLayout runs 5-6× faster than Vivado with manual constraints, and eliminates the weeks long effort to manually generate placement constraints for the hard blocks. We also perform automated post-placement pipelining of the long wires inside each convolution block to target 650 MHz URAM-limited operation. SLR replication in RapidWright allows us to clone the placement and routing image for one SLR to the multi-SLR FPGA chip in a couple of minutes. When compared to a conventional simulated annealing placer, RapidLayout's evolutionary engine delivers up to 27% improvement in wirelength, 11% lower bounding box size, 4.2× reduction in runtime, and ≈7% reduction in pipeline registers for 650 MHz operation.
I. INTRODUCTION
Modern high-end FPGAs provide high compute density with a heterogeneous mixture of millions of classic lookup tables, programmable routing network along with tens of thousands of hard blocks. These hard blocks offer ASIClike density and performance for functions such as DSPs, and on-chip SRAM access. For example, Xilinx UltraScale+ VU11P is equipped with 960 UltraRAM blocks, 4032 Block RAM slices, and 9216 DSP48 blocks capable of operating at 650-891 MHz frequencies which are typically unheard of with LUT-only designs. Furthermore, these hard blocks provide specialized nearest-neighbour interconnect for highbandwidth, low-latency cascade data movement. These features make it particular attractive for building systolic neural network accelerators such as CLP [17] , [18] , Cascades [16] , and Xilinx SuperTile [22] , [23] . Exploiting the full capacity of FPGA resources including hard blocks at high clock frequency is challenging. The CLP designs presented in [17] , [18] only operate at 100-170 MHz on Virtex-7 FPGAs but leave DSPs unused. The Xilinx SuperTile [22] , [23] designs run at 720 MHz, but leaves half the DSPs unused, and also wastes URAM bandwidth by limiting access. The chip-spanning 650 MHz 1920×9 systolic array design for the VU11P FPGA [16] requires 95% or more of the hard block resources but fails routing in commercialgrade Xilinx Vivado run with high effort due to congestion. Manual placement constraints are necessary to enable successful bitstream generation but this requires weeks of painful trialand-error effort and visual cues in the Vivado floorplanner for correct setup. This effort is needed largely due to irregularity and asymmetry of the columnar DSP and RAM fabric and the complex cascade constraints that must be obeyed for the systolic data movement architecture. Once the constraints are configured, Vivado still needs 5-6 hours of compilation time, making design iteration long and inefficient. Furthermore, to ensure high-frequency operation, it becomes necessary to pipeline long wires in the design. Since timing analysis must be done post-implementation, we end up either suffering the long CAD iteration cycles, or overprovisioning unnecessary pipelining registers to avoid the long design times.
Given this state of affairs with the existing tools, we de-velop RapidLayout: an alternative, automated, fast placement approach for hard block designs. We compare the routing congestion maps produced by RapidLayout with Vivado in Figure 1 and note the ability to match low-congestion manual placement effort. It is important that such a toolflow address the shortcomings of the manual approach by (1) discovering correct placements quickly without the manual trial-and-error loop through slow Vivado invocations, (2) encoding the complex placement restrictions of the data movement within the systolic architecture in the automated algorithm, (3) providing fast wirelength estimation to permit rapid objective function evaluation of candidate solutions, and (4) exploiting design symmetry and overcoming irregularity of the columnar FPGA hard block architecture. Given this wish list, we used the Xilinx RapidWright framework for our tool. At its core, the toolflow is organized around the design of a novel evolutionary algorithm formulation for hard block placement on the FPGA through multi-objective optimization of wirelength squared and bounding box metrics. Given the rapid progress in machine learning tools, there is an opportunity to revisit conventional CAD algorithms [1] , including the one in this paper, and attack them with this new toolbox.
They key contributions of this work are listed as follows: • We formulate a novel FPGA placement problem for tens of thousands of hard blocks as a multi-objective optimization using evolutionary techniques. • We quantify the runtime, wirelength, maximum bounding box size, and clock frequency of the implemented design using state-of-the-art evolutionary algorithms such as NSGA-II and CMA-ES. We compare them against conventional simulated annealing (SA) as well as manual floorplanning. • We develop an end-to-end RapidLayout toolflow using the open-source Xilinx RapidWright framework. • We deliver 650MHz+ URAM-limited operating frequency for the FPGA-optimized systolic array on Xilinx Ultra-Scale+ VU11P and VU13P devices to show portability.
II. BACKGROUND
We first discuss the hard block intensive systolic array accelerator optimized for the Xilinx UltraScale+ FPGAs. Next, we discuss the Xilinx RapidWright framework for programming FPGAs through a non-RTL design flow. Finally, we review the classic NSGA-II algorithm and the state-of-the-art CMA-ES algorithm and discuss their suitability for our use case.
A. FPGA-optimized Systolic Array Accelerator
Systolic arrays [6] , [7] are tailor-made for convolution and matrix operations needed for neural network acceleration. They are constructed to support extensive data reuse through nearest-neighbor wiring between a simple 2D array of multiply-accumulate blocks. They are particularly amenable to implementation on the Xilinx UltraScale+ architecture with cascade nearest-neighbor connections between DSP, BRAM, an URAM hard blocks. We utilize the systolic convolutional neural network accelerator presented in [16] and illustrated in Figure 2 . They key repeating computational block is a convolution engine optimized for the commonly-used 3×3 convolution operation. This is implemented across a chain of 9 DSP48 blocks by cascading the accumulators. Furthermore, row reuse is supported by cascading three BRAMs to supply data to a set of three DSP48s each. Finally, the URAMs are cascaded to exploit all-to-all reuse between the input and output channels in one neural network layer. Overall, when replicated to span the entire FPGA, this architecture uses 95-100% of the DSP, BRAM, and URAM resources of the highend UltraScale+ VU37P device. When mapped directly using Vivado without any placement constraints, the router runs out of wiring capacity to fit the connections between these blocks. Since, the exact same convolution block is replicated multiple times to generate the complete accelerator, it may appear that placement should be straightforward. However, due to irregular interleaving of the hard block columns, and nonuniform distribution of resources, the placement required to fit the design is quite tedious and takes weeks of effort.
B. RapidWright
In this paper, we develop our tool based on the Xilinx RapidWright [8] open-source FPGA framework. It aims to improve FPGA designer productivity and design QoR (quality of result) by composing large FPGA designs through preimplemented and modular methodology. RapidWright provides high-level Java API access to low-level Xilinx device resources. It supports design generation, placement, routing, and allows design checkpoint (DCP) integration for seamless inter-operability with Xilinx Vivado CAD tool to support custom flows. RapidWright provides access to both the logical and physical netlist of designs, and allows netlist editing facility that enables non-RTL design generation opportunities. It also provides access to device geometry information that enables wirelength calculations crucial for tools that aim to optimize timing.
C. Evolutionary Algorithms
In this paper, we generate placements for the hard blocks automatically using evolutionary algorithms. These algorithms are a set of reliable gradient-free optimization methods that adopt the idea of Darwinian natural selection to solve problems that are otherwise hard to solve in polynomial time [14] .
The encode candidate solutions into genotypes, and guide the direction of evolution with a defined fitness function. At each iteration, parent candidates firstly generate offsprings with crossover and mutation to explore the solution space. Then, the population are evaluated and selected by the fitness function. The above process iterates until the fitness function converges and a desirable solution is found. We discuss two algorithms explored in this paper:
1. NSGA-II: Non-Dominated Sorting Genetic Algorithm (NSGA-II [2] ) is a two decade-old multi-objective evolutionary algorithm that has grown in popularity today for Deep Reinforcement Learning [9] and Neural Architecture Search [11] applications. NSGA-II addresses multi-objective selection with non-dominated filtering and crowd distance sorting. Non-dominated solutions, or paretooptimal solutions, refers to the set of solution candidates that are strictly better than all other solutions for that particular objective [20] . Non-dominated sorting ranks solutions along the pareto-optimal frontier using crowding distance metric. NSGA-II alleviates the computational complexity of non-dominated sorting, and uses elitism to accelerate convergence and preserve good results. We see later in Section IV, that NSGA-II consistently produces higher quality wirelength for our placement problem.
CMA-ES: Covariance Matrix Adaptation Evolutionary
Strategy (CMA-ES) is a continuous domain optimization algorithm for non-linear, ill-conditioned, black-box problems [5] . It is the state-of-art evolutionary algorithm that has been adopted in many applications, such as neural-network hyperparameter tuning [10] . CMA-ES does not use gradients nor presume their existence, which makes it feasible on noisy, non-smooth, or non-continuous problems where derivativebased methods fail. CMA-ES models candidate solutions as samplings of a n-dimensional Gaussian variable with mean µ and covariance matrix C σ . At each evolutionary iteration, the population is generated by sampling from R n with updated mean and covariance matrix. Here, crossover and mutation become adding Gaussian noise to the samplings. Then, the candidates are evaluated and sorted according to fitness, and the top 25% is used to update mean and covariance matrix for next sampling. We use the highdimension variant proposed in [15] for our placement challenge. It restricts covariance matrix to diagonal elements, reducing space complexity from quadratic to linear and increases learning rate. As we see later in Section IV, CMA-ES consistently delivers the fastest runtime and superior bounding box sizes.
III. RAPIDLAYOUT
The challenge for mapping FPGA-optimized systolic arrays to the Xilinx UltraScale+ device is the placement of hard blocks to their non-uniform, irregular, columnar locations on the fabric while obeying the cascade data movement constraints. We first present our problem formulation, and then discuss how to embed into an evolutionary algorithm.
A. Problem Formulation
To tackle the placement challenge, we formulate the coarsegrained placement of RAMs and DSP blocks as a constrained multi-objective optimization problem. The placement for rest of the logic i.e. lookup tables (LUTs) and flip-flops (FFs) is left to Vivado's placer. The multi-objective optimization goal is formalized as follows.
subject to:
if i is cascaded after j in the same column:
In the equations above:
• i ∈ {DSP, RAM, U RAM } denotes physical hard block on which a logic block is mapped.
• C k denotes a convolution unit k that contains 2 URAMs, 18 DSPs, and 8 BRAMs.
• ∆x i,j + ∆y i,j is Manhattan distance between the two physical hard blocks i and j.
• w i,j is the weight for wirelength estimation, here we are using the number of nets (connections) between hard block i and j.
• BBoxSize() is the largest bounding box rectangle containing the hard block resources of a convolution unit C k . • x i and y i denotes the RPM absolute grid co-ordinates of hard block i that are needed to compute wirelength and bounding box sizes [24] . Understanding the Objective Function: We approximate routing congestion performance with squared wirelength (Equation 1) and critical path length with maximum bounding box size (Equation 2). These twin objectives try to reduce pipelining requirements while maximizing clock frequency of operation. While this seems odd optimization target for our solver, we have observed cases where chasing wirelength 2 alone has misled the optimizer into generating large bounding boxes for a few stray convolution blocks. In contrast, optimizing for maximum bounding box alone was observed to be extremely unstable and caused convergence problems. Hence, we choose these two objective functions to restrict the spread of programmable fabric routing resources, and reduce the length of critical path between hard blocks and associated control logic fanout.
Understanding Constraints The optimizer only needs to obey three constraints. The region constraint in Equation 3 Hard Block Placement Genotype BRAM URAM restrict the set of legal locations for the hard blocks to a particular repeatable rectangular region of size XMAX×YMAX on the FPGA. The exclusivity constraint in Equation 4 forces the optimizer to prevent multiple hard block should be assigned to the same physical location. The cascade constraint in Equation 5 is the "uphill" connectivity restriction imposed due to the nature of the Xilinx UltraScale+ DSP, BRAM, and URAM cascades. For DSPs and URAMs, it is sufficient to place connected blocks next to each other. For BRAMs, the adjacent block of the same type resides at one block away from the current location. This is because RAMB180 and RAMB181, which are both RAMB18 blocks, are interleaved in the same column.
1) Genotype Design for Evolutionary Algorithms: For the particular problem formulation discussed, brute force solutions are intractable. Naive brute force exploration will need to consider placement combinations that are of the form DSP ! × BRAM ! × U RAM ! = 9K! × 4K! × 1K! for the VU11P FPGA, a clearly absurd number. While analytical QPprogramming approaches [4] are popular for FPGA placement, they are time consuming and require expensive legalization steps. Instead, we explore classic Simulated Annealing (SA) and evolutionary algorithms for this paper. In evolutionary algorithms, we must choose the genotype defining the solution candidates with care. We decompose the genotype into three sub-problems.
1. Distribution Since the systolic array accelerator does not perfectly match the hard block resource capacity, we must choose how to divide up the resources across the multiple columns of hard blocks resources on the Xilinx UltraScale+ columnar architecture. When decoding the distribution genotype, we first normalize the distribution and then multiply the total hard block groups to get the exact number of groups to be placed in each column. 2. Location Once we choose the number of convolution blocks to place on a given resource column, we must determine which column to use. This is necessary as the columnar arrangement is irregular and non-uniform across the FPGA fabric. This is a permutation genotype that optimizes the order of elements without changing its value. 3. Mapping Finally, we make a fine-grained position selection within the chosen column. Every hard block group corresponds to a value between 0 and 1 in the location genotype denoting its relative position from the bottom to the top of its column. Legalization is performed to solve out-of-bound and overlapped placements.
We use the Composite Genotype to encode the three coupled sub-problems. In Figure 3 , we visualize the genotype design which consists of the three parts just discussed earlier. During evolution, each part of the genotype is updated and decoded independently, but are evaluated together.
2) Solution Legalization: Since the off-the-shelf evolutionary algorithm libraries [12] , [13] we use operates on real quantities for location and position information, we must legalize these into integers for use with actual FPGA placement. We need two solution legalization processes; one for for distribution and one for location. For the distribution genotype, this requires quantizing and clipping the resources to the maximum available in each hard block column while ensuring all required resources have a column assigned to them. For the location genotype, each float value defines the relative position in the column for one group of cascaded hard blocks, 0 → 1 corresponds to the south → north layout direction in the column. During quantization, we take care to avoid overlapping assignments. 
B. RapidLayout Design Flow
We now describe the end-to-end RapidLayout design flow built around the RapidWright framework. In Figure 4 , we illustrate the different stages of the flow, the approximate runtime of each stage, its interaction with RapidWright and Vivado.
The different stages of the tool are described below: • A Netlist Replication RapidLayout starts with a synthesized netlist of the convolution block with direct instantiations of the FPGA hard blocks. This convolution block DCP is a repeating unit that must be replicated across the FPGA while obeying cascading constraints to generate the full systolic array. Rather than generating placement constraints for the full-chip design directly, we instead identify the smallest repeating rectangular region in the FPGA that can fit multiple convolution blocks. We then rely on RapidWright to replicate (copy-paste) the layout to produce the full FPGA design. Columnar arrangement guarantees identical resource views as we climb the chip vertically and permits easy replication opportunities in that dimension. For example, on VU11P device, the minimum replicating area is a rectangle with a width of all columns and a height of two clock regions, as shown in Figure 7 .
We are forced to span the rectangle across the entire FPGA width due to the irregular interleaving and non-uniform distribution of hard block columns on the FPGA chip. • B Evolutionary Hard Block Placement RapidLayout uses evolutionary algorithms including NSGA-II and CMA-ES to generate hard block placement for the minimum repeating rectangular region. As the evolutionary algorithm searches over multiple candidate placement solutions, we need a quick way for estimating the objective functions without invoking the entire Vivado toolchain. RapidWright really shines here by permitting fast access to device position information necessary for computing wirelength and bounding box sizes. In a nominal Vivado-based flow, this would need at best a reverse engineering of physical distances or worse a slow and painful placement run for each candidate.
• C Placement and Site Routing Once we produce placement constraints for the hard blocks, we must embed this information in the DCP netlist for Vivado. To ensure compatibility with Vivado, RapidLayout first places the hard blocks on the physical blocks called "sites" followed by "site routing" of the intra-site nets to site pins. These steps are necessary to ensure that the placed DCP can be correctly imported back into Vivado. We also replicate placement of hard blocks in the minimum repeating rectangle until it occupied the SLR height. Site routing must be executed again for the replicated placements.
• D Post-Placement Pipelining After finalizing placement, we can compute the wirelength for each net in the design and determine the amount of pipelining required for high-frequency operation. This is done post-placement [3] , [19] , [21] to ensure the correct nets are pipelined and to the right extent. The objective of this step is to aim for 650 MHz URAM-limited operation as dictated by the architectural constraints of the systolic array [16] . This requires the URAM→BRAM links and BRAM→DSP links to be pipelined in a configurable manner.
• E SLR Placement and Routing Once the hard blocks are placed, and pipeline registers inserted into the netlist, we call Vivado to complete LUT/FF placement and overall design routing. We also generate a timing report at this stage to confirm operating frequency.
• F SLR Replication Once a single SLR has been placed and routed, we can copy it across the three SLRs using RapidWright APIs. This is possible as each SLR is identical to the other. SLR replication step requires 80 GB of heap space to accommodate the large placed and routed netlist footprint.
For the VU11P device, RapidLayout accelerates the end-toend implementation flow by ≈5-6× when measuring FPGA CAD runtime alone (≈one hour vs. Vivado's 5-6 hours). This does not include the weeks of manual tuning effort that is avoided by automatically discovering the best placement for the design.
C. Example Walkthrough
To illustrate how the different steps help produce the fullchip layout, we walk you through the intermediate stages of the flow. We will inspect three stages of going from a single block layout, to a repeating rectangle layout, and then onto a full chip layout. We develop our own visualization tool in Matplotlib to represent hard block locations and spatial relations of convolution units matching the FPGA device view in Vivado. This enables rapid debugging and visual inspection of our generated layouts without resorting to long compilations in Vivado just to observe physical locations of the hard blocks.
Single Block layout: The floorplan of a single convolution block in isolation is shown in Figure 5 where the hard block columns and chosen resources are highlighted. We also highlight the extent of routing requirements between the hard blocks in gray. The location of the URAM, BRAM, and DSP columns is irregular and forces a particular arrangement and selection of resource to minimize wirelength and bounding box size. It is clear that a simple copy-paste of a single block is not workable due to this irregularity.
Single Repeating Rectangle Layout: We explored various options for rectangle sizes that can be replicated across the FPGA. The size needed to be chosen to match the resource balance between DSPs, BRAMs, and URAMs in the convolution block with the resource balance of the DSPs, BRAMs, and URAMs on the FPGA die. The objective of this matching is to maximize the utilization of hard block resources and thereby accommodate a large number of convolution blocks for high compute density. We identify a rectangle with a height of two clock regions and width of the entire FPGA chip. Due to irregularity of the hard blocks columns, smaller rectangle sizes were not possible, although smaller would be much easier to route and replicate. In Figure 6 Fig. 7 . Full-chip layout for the systolic array accelerator generated from a repeating rectangle of size two clock regions high and the full chip wide. After one replication we span one SLR region. We place and route this with Vivado, export DCP, reimport into RapidWright to clone across SLRs. different blocks 1 . The objective of the optimization tool is to reduce the overlaps between the routing regions of the different blocks for reducing routing congestion and to contain it within small rectangles for high frequency operation. The placement optimization is to explore various locations combinations of the different hard block, the shape and spatial relationships of all convolution units to achieve the desired objectives. The resource utilization within the rectangle is 100% URAMs, 93.7% DSP48s, and 95.2% BRAMs, which also holds for the entire chip after replication.
Full-Chip Layout: The entire chip layout is generated by copying only the two clock region tall repeating rectangle at the bottom of Figure 7 . This is done in two steps: (1) First, the placements in repeating rectangle are replicated once to fill up one SLR (SLR0 on the FPGA chip). At this stage routing is not copied due to relocation limitations in RapidWright. Vivado is invoked to generate a placed and routed netlist for SLR0 (2) Second, the placement and routing from SRL0's implementation are replicated across the two other SLRs. Here, RapidWright permits the replication of the routed network as the dies are exactly identical. However, RapidWright requires 80 GB of heap space for this step.
IV. RESULTS
RapidLayout is implemented in Java to enable seamless integration with RapidWright Java APIs. We use the Java library Opt4J [12] as the optimization framework for NSGA-II and Simulated Annealing (SA). CMA-ES was implemented with Apache Commons Math Optimization Library [13] 3.4 API using the CMAESOptimizer class. NSGA-II naturally supports multi-objective optimization while Annealing and CMA-ES are configured to use the product of the two objectives (wirelength 2 and bounding box size) together as a single objective. We run our experiments on a 32-thread Intel i7 CPU and 32 GB RAM running Ubuntu 16.10 with multi-threading enabled in Opt4J. We use Vivado Design Suite 2018.3 for control logic placement, inter-site routing, and timing analysis. 
A. Performance and QoR Comparison
We compare the performance and QoR of evolutionary algorithms for placement task against the conventional simulated annealing algorithm. For NSGA-II and CMA-ES, population size and mutation rate are chosen empirically and determine convergence rate. CMA-ES is configured to update only the diagonal elements of covariance matrix to accommodate out high-dimensional problem. For annealing, cooling schedule has great influence on its performance. We experimented with multiple cooling strategies from linear to exponential, and chose the linear cooling schedule as it yields the best result. For each placement algorithm, we rerun the optimization method until convergence 100 times and collect the results. All heuristics are initialized randomly.
In Figure 8a , we plot total runtime and final optimized wirelength and maximum bounding box size for the different solution combinations possible from the 100 runs. We see some clear trends here (1) NSGA-II takes ≈1.5× more time than SA but delivers wirelength improvements of 27% (61% wirelength 2 benefit), but has a 19% larger bounding box sizes.
(2) CMA-ES takes 4.2× less time than SA with 10% average wirelength improvements and ≈11% smaller bounding box sizes. (3) An alternate NSGA-II method discussed later in Section IV-B2 with a reduced search space takes roughly 18% less time as SA, but delivers 20% wirelength improvements and only 18% larger bounding box sizes.
In Figure 8b , we see the convergence rate of the different algorithms when observing wirelength and bounding box sizes. NSGA-II clearly deliver better QoR for wirelength after 15 k iterations, while CMA-ES deliver smaller bounding box sizes within a thousand iterations. Across 100 runs, wirelength reduction trends have a tighter spread, but bounding box scaling shows much more noisy behavior with the exception of CMA-ES. This makes it (1) tricky to rely solely on bounding box minimization and (2) suggests a preference for CMA-ES for containing critical paths within bounding boxes. Finally, in Table I , we compare average metric values across the 100 runs with manual placement. CMA-ES runs the fastest in just over a minute and has ≈2× less wirelength, and 13% smaller bounding box than manual placement.
B. Parameter Tuning for Evolutionary Algorithms
In this section, we discuss optimizations to NSGA-II and CMA-ES algorithms to explore quality, runtime tradeoffs.
1) CMA-ES Parameter Sensitivity: CMA-ES has two configurable parameters: initial coordinate-wise standard deviation σ and population. The optimal initial standard deviation σ is the estimated distance from initial point to the optimum. Small σ value may result in early-termination at local optimum. As the optimal σ value is difficult to determine beforehand, we perform a sensitivity analysis of these two parameters.
We plot wirelength versus σ and population for CMA-ES in Figure 9 . With each combination of parameters the experiment is run 10 times to take the minimum wirelength. As a result, we observe that population has little effect on wirelength, while σ greatly affects the result quality. Small σ (less than 0.2) is more likely to stuck in local minima, while σ larger than 1.2 may lead to non-convergence. The optimal range is σ ∈ σ value is likely to be tied to the density and interleaving pattern of hard block columns.
2) NSGA-II Reduced Genotype: As is seen previously in Figure 8 result, NSGA-II achieves better quality of solution at the price of longer runtime. We seek to alleviate the runtime disadvantage of NSGA-II by pruning the genotype. As per the genotype design, distribution and location genotype takes up a large portion of the composite genotype, and they demand special legalization steps. However, for high-utilization designs, distribution and location are less influential since resources are nearly fully utilized. Therefore, we reduce the genotype to mapping only for NSGA-II, and uniformly distribute and stack the hard blocks from bottom to top. As a consequence of this tradeoff, we observe only a 20% wirelength improvement and 10% runtime reduction over SA. In the convergence plot of Figure 8 , we discover that reduced genotype does not save iteration needed and bulk of the runtime improvements come from reduced genotype decoding and legalization work.
C. Pipelining
Finally, we explore the effect of evolutionary algorithms on pipelining cost for one experiment with average behavior across trials. In Figure 10 , we show the improvement in frequency as a function of number of pipeline stages inserted along the long wires by RapidLayout. We note that CMA-ES is able to deliver the best improvements in frequency with NSGA-II a close second. For the 650 MHz URAM-limited frequency target, CMA-ES only needs one stage of pipelining, while NSGA-II needs two stages, and SA needs three stages. When counting total number of registers, CMA-ES result requires ≈7% fewer registers to achieve 650 MHz operation vs. annealing. Manual RapidWright-incompatible VU37P floorplan mapped to VU11P with pipelining does not deliver 650 MHz operation. If we inspect the wirelength and bounding box metrics in Table I , we can clearly see that the smaller bounding box size possible with CMA-ES correlated with its superior clock frequency behavior. RapidLayout is capable of delivering high-quality placement results on devices with different sizes, resource ratio, or column arrangements. We migrate the accelerator implementation to another device Virtex UltraScale+ 13P, which has 133% more hard block resources. We map 640 convolutional units on the entire device, and achieve 100% URAM utilization, 93.75% for DSP48, and 95.2% for BRAM. We use CMA-ES for placement and 2-stage pipelining to get 653.6 MHz optimal operating frequency.
D. Portability
These is a 6% increase in runtime for 33% increase in hard block resources with bulk of the saved runtime due to RapidWright's Module Instance placement feature. While the overall resources increase by 33%, placement and siterouting runtime remains nearly unchanged for single SLR. SLR replication on VU11P costs 3 min, while the same process costs 5 min on VU13P. Therefore, the overall runtime on a larger chip barely increased.
V. CONCLUSIONS
We show how to outperform conventional simulated annealing and manual placement for hard block intensive systolic array designs on an FPGA with evolutionary algorithms on metrics such as runtime, wirelength, and clock period. We formulate the placement of hard blocks as an optimization problem that targets twin objectives of wirelength and maximum bounding box. RapidLayout leverages the RapidWright framework to deliver an end-to-end placement and routing flow that generates the final netlist in ≈5-6× less time than Vivado and eschews manual placement effort. When compared to Simulated Annealing baseline, the NSGA-II evolutionary algorithm achieves 27% reduction in average wire length while taking 1.5× longer, while CMA-ES speeds up convergence by 4.2× while maintaining 10% average wire length reduction. CMA-ES also delivers 650 MHz URAM-limited frequency of operation with ≈7% fewer registers than annealing through the use of RapidWright-supported post-placement pipelining of long wires. RapidLayout will be open-sourced to the community.
