Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmarks are both small and simple. In this paper we present Titan, a hybrid CAD flow that addresses these issues. The flow uses Altera's Quartus II FPGA CAD software to perform HDL synthesis and a conversion tool to translate the result into the academic BLIF format. Using this flow we created the Titan23 benchmark set, which consists of 23 large (90K-1.8M block) benchmark circuits covering a wide range of application domains. Using the Titan23 benchmarks and a detailed model of Altera's Stratix IV architecture we compared the performance and quality of VPR and Quartus II targeting the same architecture. We found that VPR is at least 2.7× slower, uses 5.1× more memory and 2.6× more wire compared to Quartus II. Finally, we identified that VPR's focus on achieving a dense packing is responsible for a large portion of the wire length gap.
INTRODUCTION
Open-source CAD flows, such as the VTR project [1] , are crucial to FPGA research as open-source tools allow the FPGA architecture and CAD algorithms to be easily modified. To obtain accurate CAD or architecture results however, we need more than an open-source CAD flow. It is essential that the benchmark designs used to exercise a new algorithm or architecture represent the current, and ideally the future, usage of FPGAs. Unfortunately, the most commonly used FPGA benchmark suites are currently composed of designs that are both much smaller than the largest commercial FPGAs, and much simpler than current industrial designs. The MCNC20 benchmark suite [2] , for example, has an average size of only 2960 blocks, while the latest commercial FPGAs [3, 4] contain up to 2 million logic cells. Furthermore, half of the MCNC benchmarks are purely combinational, and none of the designs contain hard blocks such as memories or multipliers. The recently released VTR benchmark suite [1] is an improvement, but it still consists of designs with an average size of only 23,400 blocks, which would fill only 1% of the largest FPGAs. Only 10 of the 19 VTR designs contain any memory blocks and at most 10 memories are used in any design. In comparison, Stratix V and Virtex 7 devices contain up to 2,660 and 3,760 memory blocks respectively. Without larger benchmarks, key issues such as CAD tool scalability for very large designs cannot be investigated, and without more up-to-date benchmarks the validity of architecture studies is questionable.
There are many barriers to the use of state-of-the-art
benchmark circuits with open-source tool flows. First, obtaining large benchmarks can be difficult, as many are proprietary. Second, purely open-source flows have limited HDL language coverage. The VTR flow, for example, uses the ODIN-II Verilog parser which can process only a subset of the Verilog HDL -any design containing System Verilog, VHDL or a range of unsupported Verilog constructs cannot be used without a substantial re-write. As well, if part of a design was created with a higher-level synthesis tool, the output HDL is not only likely to contain constructs unsupported by ODIN-II, but is also likely to be very hard to read and rewrite with only supported constructs. Third, modern designs make extensive use of IP cores, ranging from low-level functions such as floating-point multiply and accumulate units to higher-level functions like FFT cores and off-chip memory controllers. Since current open-source flows lack IP, all these functions must be removed or rewritten; this is not only a large effort, it also raises the question of whether the modified benchmark still accurately represents the original design, as IP cores are often a large portion of the design. In order to avoid many of these pitfalls, we have created Titan, a hybrid flow that utilizes a commercial tool, Altera's Quartus II design software, for HDL elaboration and synthesis, followed by a format conversion tool to translate the results into a form open-source tools can process. The Titan flow has excellent language coverage, and can use any unencrypted IP that works in Altera's commercial CAD flow, making it much easier to handle large and complex benchmarks. We output the design early in the Quartus II flow, which means we can change the target FPGA architecture and use open-source synthesis, placement and routing engines to complete the design implementation. Consequently we believe we have achieved a good balance between enabling realistic designs, while still permitting a high degree of CAD and architecture experimentation. Our contributions include:
• Titan, a hybrid CAD flow that enables the use of larger and more complex benchmarks with academic CAD tools.
• The Titan23 benchmark suite. This suite of 23 designs has an average size of 421,000 blocks, and most designs are highly heterogeneous with thousands of RAM and/or multiplier blocks.
• A comparison of the quality and run time of the academic VPR placement and routing engine to the commercial Quartus II tool. This comparison helps identify how academic tool quality compares to commercial tools, and highlights several areas for potential improvement in VPR.
THE TITAN FLOW
The basic steps of the Titan flow are shown in Fig. 1 . Quartus II performs elaboration and synthesis (quartus map) which generates a Verilog Quartus Map (VQM) file. The VQM file is a technology mapped netlist, consisting of the basic primitives in the target architecture; see Table 3 for primitives in the Stratix IV architecture. The VQM file is then converted to the standard Berkeley Logic Interchange Format (BLIF), which can be passed on to conventional open-source tools such as ABC and VPR [5, 6] . The conversion from VQM to BLIF is performed using our VQM2BLIF tool. At a high level, this tool performs a one-to-one mapping between VQM primitives and BLIF .subckt, .names, and .latch structures. To convert a VQM primitive to BLIF, the VQM2BLIF tool requires a description of the primitive's input and output pins. VPR also requires this information to parse the resulting BLIF; we store it in the architecture file for use by both tools.
VQM2BLIF can output different BLIF netlists to match a variety of use cases. Circuit primitives such as arithmetic, multipliers, RAM, Flip-Flops, and LUTs are usually modelled using BLIF's .subckt structure, which represents these primitives as black boxes. While this is usually sufficient for physical design tools like VPR, some primitives like LUTs and Flip-Flops can also be converted to the standard BLIF .names and .latch primitives respectively. This allows the circuit functionality to be understood by logic synthesis tools such as ABC. VQM2BLIF also supports more detailed conversions of VQM primitives, depending on their operation mode. This allows downstream tools, for instance, to differentiate between RAM blocks operating in single or dual port modes.
Some benchmarks make use of bidirectional pins, which cannot be modelled in BLIF. Therefore VQM2BLIF splits any bidirectional pins into separate input and output pins, and makes the appropriate changes to netlist connectivity.
It is also important to note that the sizes of benchmarks created with the Titan flow are not limited by the capacity of the targeted FPGA family. Quartus II's synthesis engine does not check whether the design will fit onto the target device, allowing VQM files to be generated for designs larger than any current commercial FPGA. The VQM2BLIF tool also runs quickly, taking less than 4 minutes to convert our largest benchmark.
The These factors significantly ease the process of creating large benchmark circuits for open-source CAD tools. For example, converting an LU factorization benchmark [7] for use in the VTR flow [1] involved roughly one month of work removing vendor IP and re-coding the floating point units to account for limited Verilog language support. Using the Titan flow, this task was completed within a day, as it only required the removal of one encrypted IP block from the original HDL which accounted for less than 1% of the design. In addition, since over 68% of the design logic was used by the floating point units, the Titan flow better preserves the original design characteristics. A concern in using a commercial tool to perform elaboration and synthesis is that the results may be too device or vendor specific to allow architecture experimentation. However this is not necessarily the case. The Titan flow still allows a wide range of experiments to be conducted as shown in Table 1 . The ability to use tools like ABC to re-synthesize the netlist ensures experiments with different LUT sizes, and even totally different logic structures such as AICs [8] , can still occur. RAM is represented as device independent "RAM slices" which are typically one bit wide, and up to 14 address bits deep. These RAM slices are packed into larger physical RAM blocks by VPR, and hence arbitrary RAM architectures can be investigated. Similarly, multiplier primitives (up to 36x36 bits) are packed into DSP blocks by VPR, allowing a variety of experiments. A simple remapping tool could also resize the multiplier primitives if desired. The structure of a logic element (connectivity, number of Flip-Flops, etc.) can also be modified without having to re-synthesize the design, and inter-block routing architecture and electrical design can both be arbitrarily modified.
Compared to VTR, the largest limitation is the inability to add support for new primitive types.
Another use of Titan is to test and evaluate CAD tool quality. Both post-technology mapping tools and logic resynthesis tools can be plugged into the flow.
Titan provides a front-end interface between commercial and academic CAD flows which is complimentary to the back-end VPR to bitstream interface presented in [9] .
Overall, the Titan flow enables a wide range of FPGA architecture experiments, and can be used to evaluate new CAD algorithms on realistic architectures with realistic benchmark circuits, and allows for more extensive scalability testing with larger benchmarks.
BENCHMARK SUITE
A wide range of benchmark designs were run through the Titan flow, with the goal of creating a set of large benchmarks representative of modern FPGA usage. Of the 46 benchmarks converted, the 23 largest from a diverse set of application domains were chosen to create the Titan23 benchmark suite described in Section 4.2. While the rest of this paper reports results primarily on the Titan23 benchmark suite, we are releasing the full set of 46 converted benchmarks. We believe that a range of benchmark sizes can be useful during the development stages of new FPGA CAD tools and architectures.
Benchmark Conversion Methodology
To convert a benchmark from HDL to BLIF, the design was first synthesized in Quartus II. For most designs this required no HDL modification, but some required replacing vendor/technology specific IP (e.g. PLLs, explicitly instantiated RAM blocks) with an equivalent Altera implementation, or working around obscure language features. Once the design was synthesized successfully, the resulting VQM file could be passed to VQM2BLIF.
In some cases, benchmark designs required more I/Os than were available on actual Stratix IV devices, preventing the designs from fitting in Quartus II. In these scenarios, the additional I/Os were connected to shift registers whose input/output was connected to a device pin. This is similar to the methodology described in [10] .
Some IP blocks, such as the sld mux in some of Altera's JTAG controllers and older DDR memory controllers are encrypted. These IP blocks were removed from the original HDL to avoid generating an encrypted VQM file. If possible, an equivalent unencrypted IP block was substituted. This was the case for some DDR controllers, since new Altera DDR controllers are not encrypted. Once encrypted IP was removed in the HDL, the design was re-synthesized and the new VQM file passed to VQM2BLIF. In general, only a small portion of the design logic had to be modified or removed.
Titan23 Benchmark Suite
The Titan23 benchmark suite consists of 23 designs ranging in size from 90K-1.8M blocks, with the smallest utilizing 40% of a Stratix IV EP4SGX180 device, and the largest designs unable to fit on the largest Stratix IV device. The designs represent a wide range of real world applications and are listed in Table 2 . All benchmarks make use of some or all of the different heterogeneous blocks available on modern FPGAs, such as DSP and RAM blocks.
Comparison to Other Benchmark Suites
The characteristics outlined above make the Titan23 benchmark suite quite different from the popular MCNC20 benchmarks [2] , which consist of primarily combinational circuits and make no use of heterogeneous blocks. Furthermore, the MCNC designs are extremely small. The largest (clma) uses less than 4% of a Stratix IV EP4SGX180 device, making it one to two orders of magnitude smaller than modern FPGAs.
Another benchmark suite of interest is the collection of 19 benchmarks included with the VTR design flow. These benchmarks are larger than the MCNC benchmarks, with the largest (mcml) reported to use 99.7K 6-LUTs [1] . Interestingly, when this circuit was run through the Titan flow, it uses only 11.7K Stratix IV ALUTs (6-LUTs) after synthesis, indicating the differences between ODINII+ABC and Quartus II's integrated synthesis. Additionally, only 10 of the VTR circuits make use of heterogeneous resources, and none use dedicated carry arithmetic. The Titan23 benchmark suite provides substantially larger benchmark circuits that make more extensive use of heterogeneous resources.
Several non-FPGA specific benchmark suites also exist. The various ISPD benchmarks [11] are commonly used to evaluate ASIC tools, but are only available in gate-level netlist formats. This makes them unsuitable for use as FPGA benchmarks, since they are not mapped to the appropriate FPGA primitives. The IWLS 2005 benchmarks [12] are available in HDL format, and the Titan flow enables them to be used with FPGA CAD tools. However, the largest design consists of only 36K blocks after running through the Titan flow -too small to be included in the Titan23.
STRATIX IV ARCHITECTURE CAPTURE
Recall that to use the Titan flow (without re-synthesis), the architecture file must use the VQM primitives as its fundamental building blocks. The architecture file can describe an architecture built out of these primitives, which can be combined into arbitrary complex blocks with arbitrary routing. We chose to align the architecture closely with Stratix IV. This allowed us to compare computational requirements and result quality between VPR and Quartus II, and identify possible areas for improvement.
To enable this comparison, a detailed VPR-compatible FPGA architecture description was created for Altera's Stratix IV family of 40nm FPGAs [13] . The Stratix IV device family was selected over the larger, more recent Stratix V family because of the architecture documentation available as part of Altera's QUIP [14] . As detailed below, this process also identified some limitations in VPR's architecture modelling capabilities. Some of the modelled Stratix IV primitives are shown in Table 3 .
Floorplan
Stratix IV is an island style FPGA architecture, where the core of the chip is divided into rows and columns of blocks, and each column is built from a single type of block (LAB, DSP, etc.). The device aspect ratio and average spacing between blocks was determined by viewing an EP4SE820 device, the largest in the Stratix IV family, in the Quartus II floorplanner. An example floorplan is shown in Fig. 2 .
Global (Inter-Block) Routing
The global or inter-block routing in Stratix IV uses wires 4 and 20 LABs long in the horizontal routing channels, and wires 4 and 12 LABs long in the vertical routing channels. There are approximately 70% more horizontal wires than vertical wires. In Stratix IV the long wires are only accessible from the short wires and not from block pins. Additionally, Stratix IV allows LABs in adjacent columns to directly drive each other's inputs. While VPR can model a mixture of long and short wires, it assumes the same configuration in both the horizontal and vertical routing channels. Additionally, VPR cannot model Stratix IV's short to long wire connectivity, or the direct-link interconnect between LABs in adjacent columns. As a result, the inter-block routing was modelled as length 4 and 16 wires (the average lengths), with both long and short wires accessible from logic block output pins. Unidirectional routing was used and the channel width was set to 300 wires.
Logic Array Block (LAB)
In Stratix IV, each LAB consists of 10 ALMs with 52 inputs from the global routing, and 20 feedback connections from the ALM outputs. Stratix IV uses a half-populated crossbar at the ALM inputs to select from the 72 possible input signals [15, 16] . The LAB has 40 outputs to global routing driven directly by the ALMs. Since no detailed information is available on the exact switch patterns used for the half-populated ALM input crossbars, they were modelled as shown in Fig.  3 , otherwise the capture is accurate.
The Stratix IV LAB also includes three chain like structures (Carry Chain, Share Chain, Register Chain), however VPR does not currently support dedicated routing chain structures within logic blocks. Extra flexibility was added to allow each intermediate segment of a chain to drive a LAB output to ensure these structures were routable in VPR's packer.
Half of the LABs in a Stratix IV device can also be configured as small RAMs, referred to as Memory LABs (MLABs), which were also modelled. The F Cin and F Cout values were set to 0.055 and 0.100 respectively, to match the global routing connectivity in Stratix IV.
Adaptive Logic Module (ALM)
The ALM was modelled as two lcell comb primitives, each representing a 6-LUT and full adder, along with two dffeas primitives representing flip-flops. The modelled ALM connectivity is shown in Fig. 3 . The Stratix IV ALM contains 64-bits of LUT mask, less than what is required by two dedicated 6-LUTs. VPR cannot model this restriction and assumes two 64-bit LUT masks; however this extra flexibility is expected to have minimal impact on results, since few pairs of 6-LUTs can pack together in one ALM due to the limited number of inputs (8) . 
DSP Block
The Stratix IV DSP blocks are composed of eight mac mults (18×18 multipliers) and two mac outs (accumulator, rounding, etc.). In the Stratix IV architecture a mac out's inputs are only driven by mac mults. However, similarly to the LAB/ALM chain structures, the mac out's inputs must also be accessible from the global routing to pack successfully in VPR.
RAM Block
Stratix IV supports two types of dedicated RAM blocks, the M9K and the M144K, each with different maximum depth and width limitations, and supporting ROM, Single Port, Dual Port and Bidirectional Dual Port operating modes. VPR supports non-mixed width RAMs using the memory class directive, but does not provide native support for mixed-width RAMs, such as a rate conversion FIFO configured with a 1K×8 write port and 512×16 read port. To work around this limitation, mixed-width RAMs were modelled by elaborating all supported operating modes in the architecture file.
SIMPLIFIED ARCHITECTURE
As will be shown in Section 7.3, the detailed architecture capture described above performs very poorly in VPR's packer. As a result, a simplified architecture was created which makes some additional approximations. In the LAB, the half-populated crossbar used for the ALM inputs was replaced with a full crossbar. To avoid limitations related to placing LABs at MLAB locations, it is assumed that all LABs can be used as MLABs. For RAM blocks operating in mixed-width mode, the exact depth and width constraints were relaxed. While these relaxed constraints can potentially allow more RAM slices to pack into a RAM block than is architecturally possible, the RAM block will typically run out of pins before this occurs.
BENCHMARK RESULTS
In this section we use the Titan23 benchmark suite described in Section 4, in conjunction with the Stratix IV architecture capture described in Section 5. This allows us to compare the popular academic VPR tool with Altera's commercial Quartus II software. Using the Stratix IV architecture capture, VPR was able to target an architecture similar to the one targeted by Quartus II, allowing a coarse comparison of CAD tool quality.
Benchmarking Configuration
In all experiments, version 12.0 (no service packs) of Quartus II was used, while an early revision of VPR 7.0 (r1499) was used. The newer version of VPR was selected over VPR 6.0 since it provides substantial improvements to packing performance. During all experiments a hard limit of 48 hours run time was imposed; any designs exceeding this time were considered to have failed to fit. Most benchmarks were run on systems using Xeon E5540 (45nm, 2.56GHz) processors with either 16GB or 32GB of memory. For some benchmarks, systems using Xeon E7330 (65nm, 2.40GHz) and 128GB of memory were used; performance data collected on these machines is not directly comparable to the 16/32GB systems. For each benchmark, both tools were run on the same class of machines.
Since our Stratix IV architecture capture does not yet include timing information, both tools were run in a nontiming-driven mode. In VPR, this meant providing the -timing analysis off command line option, while in Quartus II the Optimize Timing fitter option was set to off.
To ensure both tools were operating at comparable effort levels, VPR placement was also run with the -fast (inner num = 1.0) option which reduces placement effort to a level similar to Quartus II's STANDARD FIT mode. Quartus II supports multithreading, but was restricted to use a single thread to remain comparable with VPR.
Quartus II targets actual FPGA devices that are available only in discrete sizes. In contrast VPR allows the size of the FPGA to vary based on the design size. While it is possible to fix VPR's die size, we allowed it to vary, so that differences in block usage after packing would not prevent a circuit from fitting.
Quality of Results Metrics
Several key metrics were measured and used to evaluate the different tools. They fall into two broad categories.
The first category focuses on tool computational needs, which we quantify by looking at wall clock execution time for each major stage of the design flow (Packing, Placement, Routing), and also the peak memory consumption.
The second category of metrics focus on the Quality of Results (QoR). We measure the number of physical blocks generated by VPR's packer, and the total number of physical blocks used by Quartus II. Another key QoR metric is wire length (WL). Unlike VPR, Quartus II reports only the routed WL and does not provide an estimate of WL after placement. If a circuit fails to route in VPR, we estimate its required routed WL by scaling VPR's placement WL estimate by the average gap between placement estimated and final routed WL (~40%).
Detailed and Simplified Architecture Comparison
Initial attempts to use the detailed architecture capture (Section 5) resulted in most circuits taking over 48 hours to pack in VPR. This prompted the creation of a simplified architecture (Section 6). To quantify the impact of these modifications, VPR's relative performance and QoR on these two versions of the Stratix IV architecture was investigated.
Of the 46 total benchmarks converted, only the smallest benchmark with 7867 blocks (and none of the Titan23 benchmarks in Table 2 ) was able to pack, taking over 47 hours. This was over 400× slower than when using the simplified architecture. The design also used 46% more LABs and had 19% higher estimated WL after placement.
The key component of VPR's poor packing performance on the detailed architecture is the partially depleted crossbar feeding the ALM inputs, which leads to many failures of the packer's routability check. Table 5 shows both the absolute run time and peak memory of VPR, and the relative values compared to Quartus II on the Titan23 benchmark suite, using the simplified architecture of Section 6. Quartus II's absolute run time and peak memory across the same benchmarks, while targeting Stratix IV, is shown in Table 6 . VPR's run time is dominated by the packing step, which takes on average~78% of the total run time on benchmarks that completed. In contrast, Quartus II has a more even run time distribution with placement taking the largest amount of time (49%), and with a significant amount of time (22%) spent on miscellaneous actions. For both tools, run time can be quite substantial on the larger benchmarks, taking up to 36.5 hours with Quartus II, and in excess of 48 hours with VPR.
Performance Comparison with Quartus II
1 It is also clear that Quartus II's run time and memory usage is more consistent than VPR's, generally scaling with the design size.
Looking at the relative run time and peak memory of the two tools in Table 5 , we can draw some further insights. One of the most obvious is that VPR's packer is substantially slower (13.3×) than Quartus II's. Furthermore, the packer's run time is quite volatile, ranging from 4.0× slower in the best case to 34.0× slower in the worst case. A portion of this difference can be attributed to the increased flexibility of VPR's packer, which can target a broad range of architectures, while Quartus II's packer targets a finite set.
Quartus II spends a large portion of its run time during placement, and this is reflected when looking at the relative placement run time of the two tools. Here, VPR's placement engine is faster than Quartus II's, taking 49% less time. There could be several reasons behind this. VPR typically uses fewer LABs than Quartus II (see Section 7.5), which decreases the size of VPR's placement problem. Quartus II also enforces stricter placement legality constraints and uses more intelligent directed moves [17] .
The tools also show a fairly large gap in routing run time, with VPR taking 3.4× longer than Quartus II. In six of the benchmarks, VPR's router was either unable to route the design or total time exceeded 48 hours. One caveat on these results is that VPR is routing high fan-out nets like clocks, while Quartus II is not, since it places these on dedicated clock networks.
As to overall run time, for benchmarks it successfully fit, VPR takes 2.7× longer that Quartus II. However, it should be noted that this result is skewed in VPR's favour, since it does not account for benchmarks which did not complete.
Peak memory consumption is also much higher (5.1×) in VPR. This is quite significant and will often limit the design sizes VPR can handle. It is interesting to note that the largest benchmark that Quartus II will fit (bitcoin miner), uses approximately the same memory in Quartus II as the smallest Titan23 benchmark (neuron) uses in VPR.
Quality of Results Comparison with Quartus II
The relative QoR results for the Titan23 benchmark suite are shown in Table 4 . These results show several trends. First, VPR uses fewer LABs (0.8×) than Quartus II. While this reduced LAB utilization may initially seem a benefit (since a smaller FPGA could be used), this comes at the cost of WL as will be discussed in Section 7.6. Looking at the other block types, VPR uses 2.3× as many DSP blocks and 1.2× as many M9K blocks as Quartus II, showing that Quartus II is better able to utilize these hard block resources. Since only four circuits use M144K blocks in both tools, it is difficult to draw meaningful conclusions.
Routed WL is our best metric for comparing the overall quality of the VPR and Quartus II physical design tools. Somewhat surprisingly, the wire length gap is quite large, with VPR using 2.6× more wire than Quartus II.
2 Without access to Quartus II's internal packing and placement statistics, it is difficult to identify which step(s) of the design flow are responsible for this difference. However, in Table 4 it appears that circuits where VPR uses substantially fewer LABs than Quartus II often have the largest difference in WL.
Modified Quartus II Comparison
To further investigate this correlation between packing density and WL, we re-ran the benchmarks through Quartus II using several different combinations of packing and placement settings. The impact of these settings on the relative QoR between VPR and Quartus II are shown in Table 7 .
We investigated the effect of telling Quartus II to always pack densely, and the effect of disabling placement finalization. In default mode Quartus II varies packing density based on the expected utilization of the targeted FPGA, spreading out the design if there is sufficient space. Also by default, Quartus II performs placement finalization, where it breaks apart clusters by moving individual LUTs and Flip-Flops. This indicates that a significant portion of VPR's higher WL is likely due to packing effects, and principally due to a focus on achieving high packing density. We suspect that VPR's packer is sometimes packing largely unrelated logic together to minimize the number of clusters. This appears to be counter productive from a WL perspective.
For example, consider a LAB that is mostly filled with related logic A, but which can accommodate an extra unrelated register B. During placement, the cost of moving this LAB will be dominated by the connectivity to the related logic A. This could result in a final position that is good for A but may be very poor for the extra register B (i.e. far from its related logic), as shown in Fig. 4a . If this is a common occurrence it could lead to increased WL. A better solution to the above scenario would have been to utilize additional clusters (pack less densely) to avoid packing unrelated logic together, as shown in Fig. 4b . Alternately, if the placement engine was able to recognize the competing connectivity requirements inside a cluster, it could break it apart, much like Quartus II's placement finalization.
CONCLUSION AND FUTURE WORK
First, we have presented Titan, a new hybrid flow that enables the creation of large benchmark circuits for use in academic CAD tools, supporting a wide variety of HDLs and a wide range of IP blocks. Second, we have presented the Titan23 benchmark suite built using the Titan flow. Titan23 significantly improves the state of open-source FPGA benchmarks by providing benchmarks across a wide range of application domains, which are much closer in both size and design style to modern FPGA usage. Third, we have presented a reasonable architecture capture of Altera's Stratix IV family, a modern high performance FPGA architecture. Finally, we have used this benchmark suite and architecture capture to compare the popular academic CAD tool VPR with a stateof-the-art commercial CAD tool, Altera's Quartus II. The results show that VPR is at least 2.7× slower, consumes 5.1× more memory and uses 2.6× more wire than Quartus II. Additional investigation identified VPR's focus on achieving high packing density to be an important factor in the WL difference.
The most obvious limitation of the current comparison between VPR and Quartus II is that both tools were run without timing optimization. In the future, we plan to add timing information to the Stratix IV architecture capture and evaluate the quality of these tools under real timing constraints.
The Titan23 benchmark suite represents a first step forward, but will need to be continually updated to keep pace with increasing FPGA design size and complexity. Therefore we would welcome additional benchmark contributions to cover larger design sizes and a wider range of applications.
Finally, given the substantial gap between VPR and commercial FPGA CAD tools, it is clear that there remains significant room for improvement in the run time, memory usage, and result quality of this academic CAD tool.
