In order to simulate large scale biological models with a reconfigurable FPGA-based biochemical simulator system, reduction of required resources are essential. This paper proposes a method which combines common terms in rate law functions appeared in biochemical models and generates a shared hardware module used for numerical integration. In this approach, two functions are combined in a tree structure level, followed by pipeline scheduling and arithmetic module binding. The evaluation result reveals that this approach reduces hardware resources by 31.4 % on average at the cost of 14.4 % throughput degradation.
INTRODUCTION
Mathematical modeling and simulation of biological processes are now essential to understand cellular activity in a system level based on knowledge accumulated in wet experiments. While a lot of biochemical simulators such as E-Cell [1] and Virtual Cell [2] have been developed, it takes a long time and requires large computational resources to simulate practical models. To cope with this problem, an FPGA-based biochemical simulator, ReCSiP has been developed [3] [4] . ReCSiP features deep pipelined hardware modules to compute reaction rates that are automatically generated from given model description [5] . However, since biochemical models contain various varieties of reaction rate functions, these modules require a large FPGA area and tend to restrict extraction of reaction-level parallelism.
This paper shows a novel method of automated design of hardware modules to compute reaction rates focusing on hardware resource optimization. In this approach, rate raw functions in model description are first converted into tree structure and common sub-trees are found and combined. Then, pipelined hardware is synthesized from the combined tree structure through arithmetic scheduling.
2. RECSIP ReCSiP hardware is a PCI board which has a Xilinx XC2VP70 and 8 chips of 1 8Mbit QDR-1 SRAM. Simulator on the FPGA consists of Solver modules which calculate change in concentration of biochemical substances solving differential equations of reactions. Multiple Solver modules are allocated in the FPGA and they are connected each other by communication switch.
As illustrated in Fig. 1 , a Solver consists of two modules, an Integrator and a Solver Core. While the Solver Core calculates velocity of a reaction, the Integrator performs numerical integration using the Solver Core. Although the Solver Core requires a lot of floating point arithmetic operations, high degree of throughput can be achieved by statically and completely scheduling the pipeline structure. A Solver Core is reaction-specific hardware that takes concentrations of substances and coefficients from three input ports (X for concentrations and the others for coefficients) and outputs the velocity. Biologists often want to launch multiple simulations on the same target with different parameters, and this pipelined structure is quite efficient to extract thread-level parallelism. On the other hand, by alleviating hardware costs of Solver Cores, it is possible to devote the rest of FPGA area to additional Solver Cores for parallel execution.
3. SOLVER CORE COMBINING ALGORITHM Systems Biology Markup Language (SBML), which is an XML-based language commonly used for biological modeling, defines 33 frequently used rate law functions as predefined functions [6] . Our early work revealed that some of the predefined functions have similar structure and they can be combined into a single Solver Core by sharing common Fig. 1 . Solver module arithmetic modules [5] . In this paper, a generic combining method that can support any rate law functions including non-predefined functions is proposed. 
Solver Core Solver arithmetic operators in the combined tree are temporally scheduled with the As Late As Possible (ALAP) scheduling algorithm. When the critical paths of original two trees have different length, they are adjusted to the longer one. However, as mentioned in Section 2, a bandwidth of data input to a Solver Core is limited to three input ports. Therefore, sometimes ALAP scheduling can not be possible due to lack of the input bandwidth. In this case, order of data input has to be considered since it can affect performance. Basically, data required by the operators that have earlier start time are input first. Then, priority is given to the data that are fed to operators in the combined subtree. After input order is decided, the final scheduling of arithmetic operators is fixed, modifying ALAP start time of some operators.
Our final process here is simple optimization in arithmetic resource binding. The same arithmetic operators used in different states can be shared in the same arithmetic module. When multiple operators of the same arithmetic are used in the same state, multiple arithmetic modules are generated.
EVALUATION AND DISCUSSION
To evaluate effect of the combining method, the algorithm was implemented in C++ (gcc 3.4.6) and was applied on the 14 rate law functions out of the 33 SBML predefined functions. Generated Verilog files are mapped on XC2VP70-6FF1517C, which is equipped on the ReCSiP-2 board, using the Xilinx ISE 8.2i tool. The frequency (MHz), latency (clock cycles), execution time (ns), throughput (Mega reactions per second) and the number of required slices of noncombined Solver Cores for these 14 functions are summarized in Table 1 . The bottom three lines show the distribution (the best, worst, and average data) for each evaluated item.
Performance Comparison
The number of two-function combination from these 14 functions is 121 = 91. Table 2 shows the distribution of the implementation results for the 91 combined Solver Cores. The best frequency of 136.6 MHz is achieved by combined functions including the pair of UMAI and UNII. Meanwhile, the pair of UMAI and UCIR shows the worst frequency of 114.5 MHz. Compared to the results for non-combined Solver Cores in Table 1 , the combined Solver Cores show lower average frequency, which comes from extra multiplexers to switch the combined functions.
The latency for the combined Solver Cores is distributed over the range of 64 (UAII and UCII) to 80 (UUCI and UCIR) clock cycles, also showing worse average than the non-combined Cores. When Solver Cores that have different latencies are combined, the shorter one must be adjusted to the other to have the same latency resulting in performance degradation. The trend of execution time of the combined Solver Cores is also degraded. While the best combination of UAII and UCII achieves 468.6 ns which is the same as that of the fastest non-combined UCII Core, the average execution time is decreased by 6.7%. Throughput of Solver Cores is related to frequency and pipeline pitch which is an interval of calculations for consecutive reactions. Since pipeline pitch reflects the number of arguments to the corresponding function, throughput is affected when combined functions have the different numbers of arguments. Throughput of the combined Solver Cores is degraded by 14.4% on average.
Hardware Costs
To evaluate the influence of the function combining upon hardware costs, we use the metric called the resource reduction ratio defined as: (1 -SA+SB) x 100, where SA, SB, and SA,B are the number of slices for the function A, B, and the combined function ofA and B, respectively. Fig.5 shows distribution of the resource reduction ratios achieved by the 91 combined functions, in which the combinations are sorted by the resource reduction ratio on the horizontal axis. While the highest resource reduction ratio stands at 49.8% (UMR and UUCR) indicating the combining cuts off almost the half of hardware costs, the worst ratio is -6.2% (UUCI and UAII) which means increase in hardware. The average resource reduction ratio is 31.4%.
To analyze and discuss these results in detail, we focus on how many arithmetic modules are shared by the function combining in both of the tree structure level and the pipeline structure level. Table 3 pair (UUCI and UAII). Reduction ratios of the arithmetic modules in the pipeline structure level do not directly reflect those in the tree structure level due to the pipeline scheduling and optimization described in Section 3.2. For instance, the best pair of the original UMR and UUCR Solver Cores requires, in all, 28 arithmetic modules in the tree structure level, and our combining method reduces them to 15 modules, that is, the reduction ratio of 46.4%. Meanwhile, only 15 arithmetic modules are required when the original two Solver Cores are individually pipelined since some arithmetic modules are shared. Moreover, the combined Solver Core requires 7 arithmetic modules after pipeline scheduling, showing 53.3% of the reduction ratio.
Focusing on the 35th pair, 26.3% of the module reduction ratio is achieved in the tree structure level, and this is increased to 36.3% in the pipeline structure level. Although Solver Cores do not consist only of arithmetic modules, the reduction ratio of them is approximately consistent with the final resource reduction ratio in this case.
The pair of UUCI and UAII, which shows the worst result, achieves the nearly same module reduction ratio in the tree structure level. However, in the pipeline structure level, the module reduction ratio stands at only 10.0%, where 10 arithmetic modules are reduced to 9. Breaking down the change of the arithmetic modules, the three multipliers are shared to one by the combining, but the number of the divider is increased. Getting the extra divider while reducing 2 multipliers leads to -6.2% of the final resource reduction ratio. These results suggest that resource reduction by combining Solver Cores is much broader than a matter of node sharing in tree structured rate law functions.
CONCLUSION AND FUTURE WORK
This paper presented a method of automate design of a Solver Core module that computes biochemical reaction rates featuring sharing mechanism of common terms in given rate law functions. In this approach, two Solver Cores are combined in a tree structure level, followed by pipeline scheduling and arithmetic module binding.
Although only ALAP pipeline scheduling is discussed in this paper, we will evaluate the effect when more advanced scheduling algorithms [5] are applied with the combining technique. In addition, evaluation of simulation performance using actual biochemical models and combining of three or more trees are also our future work.
