This paper introduces ADAM, an approach for merging multiple FPGA designs into a single hardware design, so that multiple placeand-route tasks can be replaced by a single task to speed up functional evaluation of designs, especially during the development process. ADAM has three key elements. First, a novel approximate maximum common subgraph detection algorithm with linear time complexity to maximize sharing of resources in the merged design. Second, a prototype tool implementing this common subgraph detection algorithm for dataflow graphs derived from Verilog designs; this tool would also generate the appropriate control circuits to enable selection of the original designs at runtime. Third, a comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. Experimental results show that ADAM can reduce compilation time by around 5 times when each design is 95% similar to the others, and the compilation time is reduced from 1 hour to 10 minutes in the case of binomial filters.
INTRODUCTION
FPGA accelerators have shown to be promising candidates to improve system performance and power efficiency for more than two decades [20, 33, 39] . However, the low productivity in developing FPGA-based applications compared to software development remains a huge obstacle that hinders widespread utilization of FPGA devices in main-stream systems [26] .
One of the major challenges when designing applications on FPGA devices is the lack of efficient implementation, optimization and debugging facilities [27] . In particular, compiling a hardware Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. design using standard design tools could involve a tremendous amount of time. This long compilation time limits the amount of implement/optimize-debug-edit cycles [23] per day and, as a consequence, hinders the productivity of the designers. During the optimization process, it is usual to have multiple versions [34] of an FPGA design project being experimented on actual hardware to test the functional correctness or to evaluate their accuracy. In particular, there exist many real-life applications that perform optimization by fine-tuning each version of the design, where the derivation of each version can be independent of each other and the results from one version are not required for deriving the other versions. For instance, Aubury and Luk [2] propose the use of binomial filters to implement and approximate Gaussian filtering on FPGA. The depth of the binomial filter structure can be adjusted in each version to determine the accuracy and the frequency response. Also, Targett et al. [42] carry out a precision and resolution exploration for shallow water equations for climate modeling. Such study includes reducing the bitwidth of mantissa length of variables in each version of design to balance the tradeoff between precision and accuracy. Other work such as [1] includes changing the specialized filters in each version of short read aligner to cater for different sequencing errors and genetic diversity. Such fine-tuning activities can improve the resulting design significantly, but can be time-consuming due to the repeated and prolonged process of placing and routing for each of the design version.
To address the above design optimization challenge, we propose the use of an automatic merger that combines multiple versions of a design project into a single hardware implementation. The proposed merger can identify common computational kernels between versions, perform the necessary merging and generate a final hardware design in linear time. Instead of placing and routing each individual version every time separately as shown in Figure 1 , the developer can implement the generated hardware once and hence improve optimization productivity. We note that this approach is still useful for the scenario where the derivation of each version is dependent on a former one because developers can sometimes predict the possible parameters for the succeeding versions.
Furthermore, based on the statistics from ICFPT 2015 and 2016, 75% of the full papers in the application sections utilize less than half of the resources on FPGA. In other words, there remains adequate area on chip for insertion of extra logic with the proposed merger, especially when there are only minor discrepancies between each version of the design. We collect the statistic from ICFPT instead of other conferences because there are more application-based contributions in this conference. Finally, by relaxing the timing constraints, the proposed merger enables designers to focus on checking functional correctness in hardware which is faster and more accurate than software simulation.
This paper presents ADAM, an Automated Design Analysis and Merging approach to improve the design optimization process. Given i versions of a design, this approach first parses each of them and generates the respective dataflow graphs. Then a maximum subgraph algorithm is applied to determine the common computational kernels among them with linear time complexity. Common signals with different bitwidths across the versions are also analyzed and merged if possible in order to further minimize resource consumption. Finally, the user can select a particular version of design by providing appropriate control signals to the generated hardware implementation. The proposed approach can also be applied to merge unrelated designs targeting a large FPGA.
Since the users do not need to follow the low-level details of the generated hardware implementation, the proposed approach can be considered as an overlay [23] where a virtual programmable intermediate architecture is overlaid on top of the physical fabric as a way to address the productivity challenge.
The main contributions of this work are the following:
• A novel approximate maximum common subgraph detection algorithm with linear time complexity that maximizes the sharing of resources for merging of different design versions (Section 2). • A prototype tool implementing a common subgraph detection algorithm for dataflow graphs derived from Verilog designs, which in addition generates the appropriate control circuits to enable selection of each design version at runtime (Section 3).
• A comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. (Section 4).
The next section presents the details of the proposed framework. We then describe the prototype tool and evaluate the performance of ADAM in Section 3. A comprehensive analysis of compilation time versus degree of similarity is given in Section 4, and related work is discussed in Section 5. We make conclusions in Section 6.
THE ADAM FRAMEWORK
This section provides a comprehensive overview of ADAM. Figure 2 illustrates the complete workflow of the design merger. To begin with, we consider the dataflow graphs for multiple versions of a design. As there are only minor discrepancies between each version, every dataflow graph will look remarkably similar. To identify the common subgraph between versions, a maximum common subgraph algorithm is launched and the corresponding nodes are merged, including the nodes of the common signal that can be different in bitwidth across different versions of design. Then the combined dataflow graph is directed to the compiler to generate a final hardware design.
The proposed approach has two novel aspects. First, ADAM supports merging of multiple design versions in linear time based on an approximate maximum common subgraph algorithm. Second, it covers merging of common variables that have different bitwidths across versions. In the following subsections, we describe each of the modules within the merger and their interactions in detail.
Dataflow Graph
A dataflow graph is a directed graph where the nodes represent the basic operations and variables of a design while the edges between them represent specific paths that data elements follow [44] . Every hardware circuit can be translated into a dataflow graph and vice versa, since every node in the graph corresponds to a hardware unit that can be allocated on the chip surface and every edge represents a wire between two units.
In the proposed approach, a dataflow graph is first extracted from each version v, where v = 0, 1, ..., i − 1 of the hardware design with a source-to-source compiler. Then, in order to recognize the common computational kernels, a maximum subgraph algorithm is subsequently applied between every version of dataflow graphs G 0 , G 1 , ...., G i−1 to identify the maximum amount of connected hardware elements that can be merged and shared.
Maximum Common Subgraph Algorithm
Essentially, precise detection of maximum common subgraph (MCS) in random graphs is an NP-complete problem. Existing algorithms such as McGregor or Durand-Pasari suffer from prolonged execution latency because of their exponential time complexity [6] . Therefore, such algorithms are inappropriate for adoption in the proposed design merger.
Approximate Algorithm for MCS Detection -It is noticed that the dataflow graph extracted from hardware circuits carries certain properties that can aid in the quick search for MCS. In general, nodes are connected by a few edges since most operators consist of only one or two parents and one output, and the majority of the nodes are normally labels such as signal or port names. As a result, the dataflow graph extracted is so sparse that an approximate algorithm such as [37] (time complexity: O(n), where n is the number of nodes) can be used to obtain a set of MCS with decent quality.
To approximate the MCS between two graphs G a and G b , a mapping M ab is constructed from the vertices v a ∈ V a of graph G a onto the equivalent vertices v b ∈ V b of graph G b . In [37] , Rutgers et al. present a greedy algorithm which uses best-first search to traverse the graph G a and G b . In each round of search, a vertex v a is heuristically chosen from G a so as to find a mapping to a vertex v b of G b . For every possible v b , the best candidate to choose from is determined by the following heuristic. To begin with, vertices in V a with fewer possible mapping candidates in V b are handled first, as the probability of selecting an incorrect vertex decreases with a lower number of candidates. After a vertex v a is chosen from V a , the selection of the corresponding v b depends on the similarities of v a and v b neighbors. Lastly, when a round of search completes, the vertex v a is finished and will not be selected again regardless of the search result.
To initiate the above MCS algorithm, the set of inputs I a and outputs O a of G a are matched against the set of inputs I b and outputs I b of G b respectively, and this constructs the initial common vertices in M ab . Since every version of the same design is highly similar during the design optimization process, the io interface of each version must share some common signals such as the clock or reset input. After initialization, the above heuristic, denoted by Rutдers(G a , G b , M ab ), is subsequently launched until all the vertices in V a are exhausted so as to return the MCS M ab . For further information about the approximation algorithm, please refer to [37] .
Obviously, there are several conditions to check before two nodes can be identified as common. First, both nodes need to implement the same operation, and they also have to operate on the same data type. Furthermore, associative operations such as (a + b) + c and a + (b + c) must be extracted before performing MCS detection, and commutative operations such as a + b and b + a must also be recognized as the same to minimize the area cost of the final implementation.
MCS Algorithm for Multiple Graphs -Since [37] can only determine the set of MCS between two dataflow graphs, the algorithm has to be launched iteratively until a final set of MCS for every version is obtained. The set of notations adopted in this section are given by:
• i is the total number of versions for a given design; • G p is the dataflow graph for each version, where
is the set of MCS between every version of dataflow graphs; • C is the negation of C which contains all the uncommon subgraphs;
refers to the algorithm that identifies the set C q , where 2 < j ≤ i.
To identify the set of MCS between every version of dataflow graphs, an initial set of MCS C 0 is obtained by comparing G 0 and G 1 . This newly calculated C 0 , together with G 0 and G 1 , are matched against G 2 to compute C 1 . This process repeats i − 2 times until the final C i−2 is obtained. Note that the set C i−2 , which is equivalent to C, contains every set of MCS across all i versions of design. Other nodes that are not in any of the MCS fall into the set C.
Pseudocode of a single iteration of MCS approximation for multiple dataflow graphs. Assume that C 0 is already computed for consistent input data format. A simple illustration of each iteration of the merging process, i.e.
In each iteration, G b is initialized with the graph to be matched against, while G a is composed of multiple dataflow graphs across versions, which can be conceptually considered as a single dataflow graph with numerous unconnected subgraphs. After that, the common input and output ports are mapped and inserted into M ab , and
is then executed to compute a partial MCS. Finally, the information about the current MCS and the MCS from the previous iteration are joined to obtain a complete MCS. This final step is crucial because only one vertex in G a can be mapped to a candidate in G b based on Rutgers et al. Yet in reality multiple vertices can be matched because G a is composed of dataflow graphs from every version. The MCS formulated in the previous version provides the information about rest of the mapped vertices, and hence the union of C j−3 and M ab contributes to a complete search result.
Since the number of versions i is relatively small, the overall time complexity is given by:
which means such algorithm is acceptable for the proposed design merger because of its linear time complexity.
Final Dataflow Graph Generation -In order to generate the final hardware which is logically identical to the originals, every MCS in C are first combined to generate a merged dataflow graph. The inputs and outputs of the merged graph are reconnected to the nodes in C as well. Essentially, the inputs to the MCS in C are multiplexed and the sel signal is fed to the output interface. To activate a particular version of the original design, an associated value is asserted at sel so that a correct signal from C can be directed to the merged hardware. The outputs of the merged node, on the other hand, have to be connected back to the nodes of the versions that originally use the results. Figure 3 displays an example that explains the process of multiplexing.
Analysis and Merging of Common Signals/Variables with Different Bitwidths
As our goal is to minimize the resource consumption for the combined implementation, we are also interested in merging the common signals based on their literal name even though they are different in bitwidth across various versions of design.
In the above MCS search, the common signals mentioned are considered to be non-identical because of their discrepancies in bitwidth. In order to merge these signals, the maximum bitwidth of every common signal is first obtained and the value is used to update every node that carries the same variable. After that, the same set of MCS algorithms is applied on C, which identifies a new group of MCS C ′ composed only of the newly-formed common signals. To provide a clear explanation, another set of notations is adopted in this subsection and they are defined as:
• C ′ is the set of newly obtained MCS which is composed of the common signals with different bitwidths; • C ′ = C − C ′ contains all the graphs in various versions that cannot be combined or merged with any of the methods proposed.
Assignment Nodes -Of course, since every common signal in C ′ is unique, extra hardware node is inserted in the dataflow graph during the merging phase of C ′ to ensure correctness. This includes appending multiplexers, partially-selecting and sign/zero-extending the low-level bit when the signal is appeared as an assignment node. A signal or a variable is assigned when it is either attached to the output of an operator, or directly connected to another signal in the dataflow graph. Figure 4 shows an example of the above process when the output signals are attached to an addition operator in two separate versions. Initially, the common signal Y is of width 8-bit and 16-bit in version-0 and version-1 respectively. Then, the operator is merged and its output is partially-selected and sign-extended. This enables the 16-bit signal to imitate an 8-bit signal and contributes to the same computational result.
As illustrated in the above example, different number of bits should be selected and different values should be appended in regards to the signal type and the operators attached. Normally, sign extension is applied when the signal adopts signed number representation while for unsigned number representation zero extension would suffice.
Comparison Operator -Furthermore, for every common signal that is connected to a comparison operator (e.g. == < ≤ > ≥), partially-selecting the low-level bit is required. This is due to the fact that comparison is based on the left-to-right evaluation, and the sign/zero-extension process performed above will incur an incorrect comparison result if left unattended.
Connection to the MCS in C -Depending on the original structure of the dataflow graph, the inputs or outputs of each MCS in C ′ can be connected to the previously formed MCS in C, or simply connected to the nodes in C ′ . The following description summarizes all possible combinations and provides a detailed explanation for each scenario.
(1) C ′ and C Unconnected -In this scenario, every input and output of an MCS in C ′ are connected to the uncommon subgraphs in set C ′ . Similar to the multiplexing mechanism as shown in Figure 3 , the inputs are multiplexed and the sel signal is fed to the control interface. Also, the outputs of the merged graph must be connected back to the nodes in the uncommon graphs that use the calculated results. (2) Outputs of C ′ connected to Inputs of C -This is the case where the outputs of a MCS in C ′ are connected to any input nodes of a MCS in C. To link both MCS together, the outputs are first partially-selected and sign/zero-extended, which is similar to the example in Figure 4 . The multiplexers inserted in Section 2.2 are also slightly modified. The inputs of the original multiplexer in C are disconnected so that the sign/zero-extended outputs and the unmodified outputs can connect to them. (3) Inputs of C ′ connected to Outputs of C -In this case, the outputs of an MCS in C can be connected to the inputs of an MCS in C ′ directly, without the need to introduce extra hardware. This is because the partial-selection and bitextension process during assignment can always guarantee that a common signal will carry a correct value.
Currently, merging of common signals with multi-bitwidth always takes place regardless of hardware cost, which may be less desirable for low-cost operations such as addition. In the future, we plan to extend the merging heuristic by considering multiplexing versus operator savings to further minimize the final resource consumption.
Optional Timing Optimization
Based on [44] , the throughput of a hardware mainly depends on the number of data items that the design can process in one cycle, and also the maximum clock frequency that the design can support. Therefore, the proposed design merger provides an optional mechanism for users to perform certain re-pipelining if the dataflow graph is direct acyclic.
It is often hard to fulfill timing constraints when an output signal is connected to many hardware nodes. Since it is difficult for the synthesis and implementation tool to place the hardware nodes in close proximity, the resulting wire length will consequently increase. To address this challenge, the proposed approach can insert registers in a tree-like fashion such that each register only consists of a limited amount of outputs if the timing optimization mechanism is activated by the users.
Final Implementation Generation
After detecting and approximating the MCS and merging the common nodes with the methods proposed, the final dataflow graph, which is formed by C, C ′ and C ′ , can be supplied to the source-tosource compiler to generate the final implementation.
Usually, the compiler can produce the final hardware that is in the same language as the original design. However, depending on the needs of the designers, the compiler can be extended to produce the corresponding source code in another programming language such as Chisel [3] or Verilog to promote productivity. 
PROTOTYPE TOOL & BENCHMARKS 3.1 Prototype tool for Pyverilog
With the objective to improve designers productivity during the optimization process of FPGA implementations, the key goal of ADAM is to merge every version of a design automatically while minimizing resource consumption. To demonstrate the feasibility and viability of ADAM, we prototype the proposed approach with Pyverilog [41] to support the functionality mentioned in Section 1 and Section 2 as a proof of concept.
Pyverilog is an open-source toolkit that provides register transfer level design analysis and code generation of Verilog HDL. Written in the Python programming language, Pyverilog incorporates multiple libraries such as parser, dataflow analyzer and Verilog code generator that are useful to realize the proposed design merger. In our prototype, we use the given parser and dataflow analyzer to generate the dataflow graph for each version of a Verilog-based design. Then, we approximate and combine the MCS iteratively in linear time using the algorithm presented by Rutgers et al. [37] . To further optimize the tool efficiency, we also perform the search and the necessary merging of the common signals that are different in bitwidth during the above iterative MCS detection. Optional timing optimization is performed by analyzing the number of fanout of any outputs. The final dataflow graph is processed by the Verilog code generator to produce a final, Verilog-based hardware description.
Moreover, the decision to implement the proposed approach based on Verilog is mainly a consideration for design productivity and tool portability. Verilog is one of the most-used design languages to describe a hardware structure at the register transfer level for FPGA-based implementations. In addition, Verilog and VHDL are usually used as an intermediate representation for opensource or vendor EDA tools in modern high-level synthesis and next-generation HDL research [41] .
With Pyverilog extended to support the proposed approach, we run the design merger on HP EliteDesk 800 G2 Tower PC with Intel 
Benchmarks from VTR
Experimental Setup -We select several parameterizable Verilog designs from the VTR Benchmarks [28, 36] and automatically combine them with the prototype merger in order to understand its implications in terms of real-life applications. These applications include bgm, LU8PEEng and array of diffeq1 which provide macros or parameters for users to explore different hardware structures and to offer multiple versions of a single design. Table 1 illustrates the configuration details for these applications. In diffeq1, each version is obtained by adjusting the bitwidth of all signals, while for bgm and LU8PEEng the macros BITS and PRECISION are altered respectively so that different precision can be used to calculate the final results. As lowering the precision and changing the corresponding macros eliminate certain parts of the original circuit, the resulting dataflow graphs vary across different versions. Additionally, the adjustment of the macros changes the width of several signals, and hence creating common signals with different bitwidths for merging.
Finally, the generated hardware and original hardware are synthesized and implemented individually using Vivado with the default settings, and data about the area cost and compilation time are collected subsequently. Also, the maximum frequency for each implemented hardware is obtained by specifying different timing values in the constraint file and compiling separately until the tool fails to meet the timing constraint. Optional timing optimization is not activated for these benchmarks.
Experimental Results -For each application, the area cost and the maximum frequency for every version, including the combined ones, are displayed in Table 1 . The percentage values are relative to available resources of the targeted FPGA device. As expected, the reduction in bitwidth of certain signals between versions contributes to a decrease in total resource consumption, and sometimes improves the maximum frequency of the implemented hardware. The generated hardware, on the other hand, shares similar properties in terms of area cost and timing when compared to the original implementations. The resources consumed are increased only by around 2 % with reference to the Artix-7 AC701 FPGA, which is one of the smallest FPGAs in the Xilinx 7-series. Moreover, the maximum frequencies in diffeq1 and bgm are reduced by 10 to 12 %, which are moderate given the functionality that ADAM provides. The maximum frequency supported by the merged hardware in LU8PEEng, on the contrary, is improved by around 4 times when compared to the original implementation. This unexpected result arises from a similar timing and fan-out issue mentioned in Section 2.4. Originally, the register recResult in LU8PEEng is assigned by a wide multiplexer where the inputs are connected to repeating subsets of the same signal. Such assignment incurs a large fan-out and subsequently limits the maximum frequency of every version of implementation. Nevertheless, the bitwidth of recResult is defined with macro PRECISION and as a result, it is resolved as a common signal with different bitwdiths during the merging process. The insertion of registers for zero-extension increases the number of driving gates for recResult and hence the number of fan-out is reduced, which in turn improves the overall timing. We note that such an improvement in the maximum frequency can be obtained by fan-out optimization [18] , which can be applied in addition to dataflow graph merging.
Finally, Figure 5 shows the total compilation time of the generated hardware versus the sum of compilation time of each hardware application. The time recorded includes the duration of synthesis as well as implementation.
Case Study: Binomial Filters
This subsection presents a case study on one of the applications mentioned in Section 1: Binomial Filters. Such filters are efficient structures based on binomial coefficients to realize Gaussian filtering on FPGA. There are numerous possible variations of the basic binomial filter structure and therefore an analysis of the accuracy and frequency response is required when implemented on FPGA [2] . In particular, an analysis with actual hardware is important for such filters because it usually provides more accurate results such as frequency response with respect to signal inputs when compared to software simulation. An example of a binomial filter used in this experiment is shown in Figure 6 . The structure of the binomial filter is derived from the polynomial (1 − z −1 ) n , and it can be implemented with a cascade of adders with one of the inputs delayed by a register. Such a cascade is arranged in a pipeline structure where the depth is given by the parameter n. The quality of the approximation to Gaussian filter depends on n where the error is reduced to a small value for large filters. For further information about binomial filters, please refer to [2] .
Experimental Setup -We populate multiple binomial filters on the FPGA to allow parallel processing where each of them supports 64-bit calculation. The FPGA is populated with 32 filters so that the total resource consumption is around 40%. As mentioned above, the depths of the filters need to be fine-tuned to determine the accuracy of the binomial filters. Therefore, in this experiment, the depth of the filters is varied in each version while the target frequency is fixed at 100 MHz. Similar to the previous benchmarks, optional timing optimization is not activated in this case study, and the generated hardware and the original design versions are synthesized and implemented individually using Vivado with the default settings. Table 2 illustrates the configuration details for every version and also the corresponding implementation results.
Evaluation Results -The area cost of every version and of the combined hardware are shown in Table 2 . Obviously, the reduction of depth contributes to a decrease in total area cost in each version, while the resource consumption of the combined hardware remains competitive compared to the originals. The LUTs and registers consumed are only increased by around 2% with reference to the target Artix-7 AC701 FPGA. This clearly showcases the efficiency of the MCS approximation algorithm since the bitwidth is set to be identical across versions, and the merging of common signals with different bitwidths is not executed in this case study.
The total compilation time, on the other hand, is presented in Figure 7 . Similar to the VTR benchmark, the time recorded includes the duration of synthesis as well as implementation. From the figure, it can be seen that the speedup in compilation time is around 5.9 times when compared to the combined compilation time of all the originals. In particular, the overall compilation time is reduced from 1 hour to around 10 minutes. Such a significant result is due to the increase in version counts and also the relatively high similarity between versions. It shows that the MCS algorithm proposed in Section 2 is able to identify most of the common vertices among all the dataflow graphs, and this contributes to a promising speedup in compilation time with only a minor increase in resource consumption. Also, it is expected that the overall compilation speedup will be more significant if more versions are supplied to the proposed design merger. Finally, the execution time for MCS detection and dataflow graph merging is only 1.43 seconds and is insignificant compared to the synthesis and compilation time.
EVALUATION
As the above experiment is based on the architecture of the applications provided by VTR Benchmarks and binomial filters, the variations or the degree of similarity between versions cannot be adjusted randomly. In this evaluation, we explore the relationship between compilation time and degree of similarity by varying the number of design versions i and its resource consumption on FPGA.
Evaluation Setup -We populate the FPGA with multiple diffeq1 modules so that 30 %, 40 % and 50 % of the FPGA slices are initially occupied by each unmerged design version. After that, we introduce discrepancies between versions by changing the signal names literally. This enables a fine-grained adjustment of the degree of similarity between versions. Then all the design versions are applied to the prototype merger to generate a merged design, which is passed to Vivado subsequently to record the compilation time. The original versions are also synthesized and implemented separately in order to make a comparison.
Essentially, the compilation time of the unmerged designs is the summation of the synthesis and implementation time of all the versions. For the merged designs, the compilation time simply refers to its own synthesis and implementation time. The degree of similarity is defined as the proportion of computational hardware that is common between versions. We note that we do not use the context of dataflow graph for this definition because each node can represent different hardware types which contribute to different area costs.
Evaluation Results - Figure 8 displays a summary of the experimental results which demonstrates the scalability of the proposed approach. Note that 100% similarity refers to the scenario that every design version is logically equivalent, and it indicates the scope of the maximum compilation speedup. The soaring compilation time in the figure illustrates the failure of placement and routing in which the merged hardware is larger than the area of the given FPGA.
Originally, the total compilation time is linearly proportional to the number of versions as indicated by the flat lines in the figure, whereas the compilation time of the generated hardware is independent of the version counts. Since the synthesis and implementation time of the merged designs purely depends on the degree of similarity, extra logic is only introduced when there exist variations between versions. Thus, the compilation time increases with the decrease of similarity until the FPGA runs out of resources for the generated hardware.
It is also noticed that the compilation time of the generated hardware is largely similar regardless of the numbers of versions when every version is 85 % to 100 % in common. The compilation time is around 400 s to 1000 s which is at least 3 times faster when there are seven versions of the same design. The compilation speedup can be further improved to around 5 times if the versions are 95 % similar. It is expected that, based on the assumption that there is adequate space on chip, the improvement will be larger when more designs are merged. Finally, we note that although the performance numbers are based on diffeq1, parameters such as relative compilation speedup are important specifications for other applications when ADAM is employed by designers to perform FPGA design optimization.
RELATED WORK
The concept of supporting multiple versions is described in [43] where conservation cores, i.e. specialized processors that focus on reducing energy, are designed to run both past and future versions of code. However, the notation of versions in [43] is different from our work since each version in the proposed approach is independent of each other. In other words, all the variants between versions are already known at the time when designers need to synthesize and implement the hardware.
Automated dataflow graph merging has been extensively studied in the context of runtime reconfiguration, high-level synthesis and instruction set extension. Fazlali et al. [9, 10] propose a datapath merging algorithm based on approximating the maximum weighted clique to shorten the bitstream and to reduce reconfiguration time. Voss et al. [44] present a cost-driven heuristic to minimize the area cost within an HLS application. Other work such as [5, 30, 35, 40, 45] focuses on resource sharing of multiple instruction set extensions (ISEs) for extensible base processors. For example, a path-based heuristic approach is presented in [5] in which a set of ISEs is transformed to a hardware datapath. Maximal subsequences problem is then applied to maximize area reduction. Zuluaga and Topham later extend the work by introducing latency constraints in the merging process [45] . Similarly, a heuristic that uses the construction of compatibility graph is proposed in [30] and a nonexact method is suggested to perform datapath merging. Such heuristic is also employed in [35] to increase area reduction by accounting for the cost of multiplexers. Since the merging latency is not a prior concern in most of the work mentioned (from exponential to polynomial time complexity), the corresponding algorithms are less appropriate for ADAM. A linear time heuristic is proposed in this paper to minimize the merging time because reducing compilation time is an important objective of the proposed approach.
On the other hand, researchers have tackled the challenge of prolonged hardware implementation, optimization and debugging runtime in many different ways. For example, overlay architectures have been leveraged to offer faster compilation as well as improved programmability and runtime management. Recently, overlays with different granularity ranging from virtual FPGAs [4, 8, 15, 22] , soft processors [16, 17, 32] to CGRA overlays [7, 11, 19, 26, 27] and GPU-like overlays [21] have been proposed.
In addition, some have addressed the challenge from a design methodology's perspective. In [24, 25] , the authors propose the use of pre-built hard macros and modular design flow to minimize the placement and routing process. A similar approach is also presented in [14] where a library of precompiled macros is constructed for HLS. Finally, some researchers have devoted their efforts to lowlevel FPGA EDA tools to improve implementation speed. In [31] , the authors accelerate the placement and routing process by making quality-runtime tradeoffs. The implementation runtime can also be improved by parallelizing the placement algorithm [13, 29] . Dynamic partial reconfiguration is leveraged in [38] and [12] to shorten runtime by effectively reducing the user design size.
Compared to these contributions, the proposed approach represents an orthogonal solution to improve designers productivity by eliminating the need to perform placement and routing for different design versions repeatedly. It is possible to use ADAM together with the above optimization techniques to reduce compilation time, and such opportunities will be explored in the next section.
CONCLUSION AND FUTURE WORK
A new approach, ADAM, is proposed for merging multiple FPGA designs into a single design to support rapid functional evaluation. ADAM is based on a novel approximate maximum common subgraph detection algorithm with linear time complexity, which is developed to maximize the sharing of resources after merging designs. Preliminary results show that ADAM can reduce compilation time by 3 to 5 times. Further research includes studying additional optimization such as adopting pre-placed macros for ADAM, extending ADAM to support multi-chip implementations, signal merging for floating-point numbers, inclusion of additional applications and incremental compilation to evaluate the proposed approach. Finally, we note that the concept of multiple graphs analysis and merging can be applied to multiple designs at the dataflow graph level for many purposes, we just focus on one of the possibilities which is to improve design productivity in this work.
