Computational fluid dynamics (CFD) is an important tool for designing aircraft components. FaSTAR (Fast Aerodynamics Routines) is one of the most recent CFD packages and has various subroutines. However, its irregular and complicated data structure makes it difficult to execute FaSTAR on parallel machines due to memory access problem. The use of a reconfigurable platform based on field programmable gate arrays (FPGAs) is a promising approach to accelerating memory-bottlenecked applications like FaSTAR. However, even with hardware execution, a large number of pipeline stalls can occur due to read-after-write (RAW) data hazards. Moreover, it is difficult to predict when such stalls will occur because of the unstructured mesh used in FaSTAR. To eliminate this problem, we developed an out-of-order mechanism for permuting the data order so as to prevent RAW hazards. It uses an execution monitor and a wait buffer. The former identifies the state of the computation units, and the latter temporarily stores data to be processed in the computation units. This out-of-order mechanism can be applied to various types of computations with data dependency by changing the number of execution monitors and wait buffers in accordance with the equations used in the target computation. An out-of-order system can be reconfigured by automatic changing of the parameters. Application of the proposed mechanism to five subroutines in FaSTAR showed that its use reduces the number of stalls to less than 1% compared to without the mechanism. In-order execution was speeded up 2.6-fold and software execution was speeded up 2.9-fold using an Intel Core 2 Duo processor with a reasonable amount of overhead. key words: computational fluid dynamics (CFD), field programmable gate array (FPGA), scientific computations, reconfigurable hardware, out-oforder system
Introduction
Computational fluid dynamics (CFD) is a widely used numerical analysis tool for simulating fluid behavior. It is particularly attractive for designing aircraft components such as engines and bodies because of its cost efficiency compared with actual testing using a wind tunnel experiment. CFD is typically executed on parallel computers such as supercomputers and large clusters due to its vast amount of floating-point computations.
The FaSTAR (Fast Aerodynamics Routines) CFD package developed by JAXA (Japan Aerospace Exploration Agency), is one of the most recent generic CFD packages which supports a number of subroutines with automatic grid data generation. It uses an unstructured mesh as a grid data form for representing a complicated structure of simulation space. However, execution on highly parallel machines efficiently is difficult because of its consecutive accesses using pointers and irregular memory access patterns caused by the unstructured mesh [1] . Since economical accelerators like GPGPU (general-purpose computing on graphics processing units) and Cell Broadband Engine rely on parallel data processing, they are not suitable for accelerating applications with complicated data dependency [2] , [3] . Reconfigurable systems consisting of multiple field programmable gate arrays (FPGAs) are a promising solution to the parallel processing problem. Although FPGAs have been considered unsuitable for scientific computing due to the vast number of floating-point calculations, advance in FPGA technology have made their use feasible [4] . Maxeler [5] and Convey [6] are representative examples of accelerators using multiple FPGAs for use in scientific applications. Additionally, heterogeneous clusters using FPGAs and other types of accelerators have expanded the range of potential target applications [7] , [8] .
Several research efforts have investigated the use of a reconfigurable platform to accelerate CFD. For example, Sano et al. [9] and Kocsardi et al. [10] focused on using FP-GAs to accelerate CFD execution, and Morishita et al. [11] focused on speeding up the UPACS accelerator. However, their target application was a stencil with regularity, and FaSTAR has no regularity. We investigated the use of a multi-FPGA platform called FLOPS-2D (Flexibly Linkable Object for Programmable System -2 Dimensional) to accelerate FaSTAR execution.
We tackled this problem by designing a mechanism for out-of-order execution. As there are other problematic subroutines in FaSTAR and similar problems in other applications, we extended the mechanism to a general out-of-order (OoO) mechanism that can be used for all possible subroutines in FaSTAR. This was made possible by designing an OoO generator that automatically changes the parameters so that an OoO mechanism suited to the target subroutine is generated. This mechanism can be applied to similar problems in other simulation packages.
The rest of this paper is organized as follows. Section 2 discusses work related to this study. Section 3 describes FaSTAR and the effects of data dependency. Section 4 in-Copyright c 2014 The Institute of Electronics, Information and Communication Engineers troduces an implementation example of proposed architecture. Section 5 describes the Out-of-Order mechanism and Sect. 6 discusses the evaluation results. Finally, Sect. 7 will summarize this work with a conclusion.
Related Work
There are various studies using FPGAs reported so far to examine pipelined accumulator problems. Dinechin et al. studies two common situations where the flexibility of FP-GAs allows one to design application specific floating-point operators [12] . It address the problem for applications involving the addition and sum-of-product of a large number of floating-point values. A novel architecture was proposed and resulting improvement in the area/accuracy tradeoff. However, this work is for general floating-point computations which is not suitable for practical software package like FaSTAR.
In addition, Zhuo et al. has proposed reduction circuits using deeply pipelined operators on FPGAs to solve data hazards occur during sequential reduction operations [13] . They identify two basics methods for designing serial reduction circuits: the tree-traversal method and striding method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. However, the proposed work is more generic and exposes a finer control of the accuracy and performance tradeoff. Another work has proposed a scalable FPGA-array with bandwidth reduction mechanism [14] . They construct a systolic computational memory array, which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Although satisfactory performance is achieved, their implementation is for single precision floating-point calculation and based on systolic architecture.
Another method to reduce pipeline stalls is static permutation or data reordering which was proposed by Nagy et al. [15] . However, in our case, a typical static permutation for improvement of the software execution has been already applied in FaSTAR software package. The OoO mechanism proposed here is to remove stalls which occur after the static permutation.
FaSTAR
FaSTAR is a CFD package developed by JAXA to simulate a compressible flow using unstructured grids. Its source code is written in Fortran 90, and it has a message passing interface (MPI). Users can select various solutions supported by choosing certain subroutines and run simulations in parallel without specific software tuning. The configuration is determined by user-prepared parameter file. Grid data files are also prepared by the user beforehand.
The unstructured grids used by FaSTAR result in irregular and unpredictable memory access patterns. To access a grid, FaSTAR prepares face2cell array. If a face num- ber is given as the array index, the numbers of two contiguous grids (cell-A number and cell-B number) sharing a face are obtained. Example grids and their access patterns are shown in Fig. 1 , where x and n are the face number and the grid number, respectively. Faces are accessed in ascendingorder. When face x is accessed, cell-A is accessed as grid n, and cell-B is as n-2. Access to face x+1 in the next step means that grid n is accessed as cell-A, and grid n-1 is accessed as cell-B. Face x+2 is accessed in the third step, and the grid n is treated as cell-A, while grid n+1 is treated as cell-B. This means that the memory access patterns have a certain degree of spatial locality, but no predictability.
Target Subroutines
We first profiled FaSTAR run time and investigated the data flow. Analysis of the results revealed that a particular subroutine ran inefficiently due to many stalls. A memory access pattern with data locality in the subroutine resulted in frequent pipeline hazards, which degraded performance. The unique memory access of FaSTAR, which has no regularity, makes this problem difficult to eliminate.
From the results of our initial profiling, we identified the subroutines that accounted for a large portion of the execution time. One of the most complicated computations was performed by the Advection term subroutine-it accounted for 23.5% of the execution time. It computes flux and calculates the total flux for each grid. It supports several approaches to computing flux, and whichever one is used, there are two processing steps in common: prepare the data and calculate the total flux. The Advection term subroutine prepares the required data from memory and then computes flux by using one of several computations. The Surface Integral (or Surface) subroutine then calculates the total flux. The Surface subroutine has data dependency while the others do not, so it can create a bottleneck in the computation. Fortunately, this subroutine is not too large to be implemented on an FPGA, so it was selected as the first target.
The Surface subroutine uses the finite volume method (FVM) to discretize the space. The product of vertical flux and area for all faces in a grid are summed using the following equation:
where k is a face number in a grid, F k is a five-dimensional flux vector on face, Q is a conserved quantity in a grid, k max is the number of faces in the grid, and ds k is the area of face k. Equation (1) is translated into Fortran source code in the form of four equations as follows:
In these equations, calculation is performed using the cell-A number (A n ) and the cell-B number (B n ) given by face2cell array, unlike Eq. (1), in which calculation is performed using the face number. Flux summation ( f lux sum) for a grid is done by summing up with area and f lux products for all faces of the grids. The initial flux summation value for each grid is zero per grids. In the source code, flux is represented as an array with five elements, and each element can be accessed by using its number as an element, e.g. f lux 3 . Like f lux, f lux sum is a two-dimensional array for which the index indicates a vector element and a grid number. The computation is performed using doubleprecision floating-point variables.
As shown in Eq. (2), the Surface subroutine computes f lux sum using temporal results obtained in previous iterations. Such data dependency reduces the computational efficiency because data must wait to be processed when they are being processed, otherwise, the results are incorrect [16] . In addition, degraded performance is also observed in other subroutines: (2) Green Gauss, (3) summation of fluxes' eigenvalues (Sum eigen), (4) maximum of fluxes' eigenvalues (Max eigen), and (5) coefficient initialization (Coefficient). Although these subroutines are similar to Surface, they use different physical quantities, and their calculations differ. The Max Eigen subroutines finds the maximum eigenvalue by comparison, while the others, like Surface, add values. The objective of the work reported here is efficient execution of these subroutines on an FPGA platform.
Subroutine Implementation

Architecture for Target Computation
Assuming how data is accessed from external memory, we use BlockRAM (BRAM), which is distributed high-speed RAM implemented on an FPGA, for the target subroutine. This means that some data are accessed several times within a short period. Therefore, this is better than reading data from external memory such as SDRAM on each access, which is time-consuming. With BRAM, data is obtained within one clock cycle following the request. Data can thus be accessed as quickly as possible by prefetching it from 
BRAM.
The architecture for implementing the Surface subroutine on an FPGA is shown in The lower diagram in Fig. 2 shows the structure of Processing Module B. Processing Module A has a similar structure, but it performs subtraction instead of addition. As represented in Eq. (2), value, which is product of area and flux, is added to read data, and the result is called flux sum. This process is enabled by the generation of addresses for the RAM using Address Generator and by synchronization and control using a newly introduced mechanism. The architectures for implementing the other subroutines on an FPGA differ slightly. For example, for the Max Eigen subroutines, the adder is replaced with a comparator for extracting the maximum value. Processing Module B works as follows.
1. The input data comprises cell-B, area, and flux.
cell-B is input into Address Generator, and the multiplier fetches area and flux. 2. Address Generator generates read address from cell-B, and the multiplier outputs the product of area and flux, called the "value". These values arrive at summation module through a first-in first-out (FIFO) buffer. 3. read address is obtained from address FIFO, accesses the data entry and requests read data from the BRAM. read data are added to the corresponding value and stored in the same entry as accessed previously. 4. After Processing Module has iterated steps 1-3 for a certain amount of data, result data are transferred from BRAM to external memory through I/O Module, prefetched data in conversely stored in BRAM.
The number of processing modules used depends on the problem size. This results in many accesses to external memory. Moreover, it is inefficient for the processing modules to pause while data are transferred from or to external memory. Memory accesses are thus hidden by using a double-buffering mechanism. Double-buffering enables a system to process data and access memory simultaneously by using two RAMs or registers. We applied the doublebuffering mechanism to BRAM and allocated two BRAMs to summation module. One of them is used for summation, while the other stores data and prefetches data. The role is changed to the other after processing continuously. This problem is not difficult to solve because double-buffering is a common technique.
Performance Degradation
A more difficult problem to solve is the occurrence of pipeline stalls due to read-after-write (RAW) data hazards. They occur in processes with data dependency, and are observed in summation module, which calculates the sum of data obtained in previous processing. When data are requested while they are still being processed, the system must wait until the processing has been completed. Since such stalls occur within short intervals, resource utilization is degraded for arithmetic pipelines.
This problem is illustrated in Fig. 3 , which shows an enlarged view of summation and example input data (address and value). The numbers in the address boxes represent the addresses of value in BRAM: the first address is "1". In the first step, the three "1" addresses are extracted from the address queue. However, since the first result is still being computed as evidenced by the string of 1's, the current computation must wait until result of the previous computation is stored in BRAM. That is, a pipeline stall occurs. If the next address is "2", the current computation does not have to wait since the pipelined operation unit can accept an input every clock cycle. In computation using an unstructured mesh, RAW hazards occur frequently and thus severely degrade performance. Moreover, when such problems will occur is unpredictable. We avoid this problem by using a controller, called "Out-of-Order Mechanism", to stream data between FIFO and the summation module. As a result, the input data are reordered, and the calculations in summation are done more efficiently. Since the original order is irregular, the controller reorders the data dynamically.
Out-of-Order Mechanism
The OoO mechanism changes the order of computation and data access to optimize processing efficiency. Although the purpose is similar to that of Tomasulo's algorithm, which is used in general purpose CPUs [17] , the OoO mechanism is for the hardware accelerator in an FPGA. Thus, the target computational unit is a single pipelined floating point calculator with fixed pipeline depth. An OoO mechanism must be provided for each computational unit in an FPGA implementation subject to RAW hazards. Although the OoO mechanism presented here is designed on a multi-FPGA platform, it can also be used for other FPGA accelerator designs. Here, its design mechanism, and operation are introduced.
Design
The OoO mechanism changes the order of data sent to the computational unit without changing the context of the target data flow graph, as illustrated in Fig. 4 . We assume that a sequence of data comes from the input FIFO queues to the computational system in the order of "1", "1", "1", "2", "2", "3", "4", and "4" as shown in figure. A value and corresponding address pair is assumed to be extracted every clock cycle. Since the operational unit is fully pipelined, it can accept input data every clock cycle if there is no hazard. In the example shown, stalls occur when the 2nd (2nd "1"), 3rd (3rd "1"), 5th (2nd "2"), and 8th (2nd "4") pairs are processed. The OoO mechanism changes the order as shown in the figure so that no stall occurs. If a pair cannot be computed at the moment, it is temporarily stored in some registers. This re-ordering is done without changing the context.
Two types of registers are used: an execution monitor and a wait buffer. The execution monitor uses a FIFO queue with a depth that is the same as that of the operational unit pipeline including the RAM access stage. When the operation unit starts the computation process, the address of the Fig. 4 The effect of Out-of-Order system. input value is pushed into the queue, and it is moved forward clock by clock. When the address exits the queue, the operational unit has finished the computation, and the results have been stored in RAM. Unlike a common FIFO queue, the addresses of the execution monitor can be read out in parallel and compared with both the new address from the input FIFO queue and entries from the wait buffer. If the incoming address matches the entry, the value for that address is being processed, so the system must wait until the address has been removed from the execution monitor. The wait buffer holds pairs waiting for computation to finish. That is, the pair in the wait buffer must wait until their address has been removed from the execution monitor. For that to happen, the address of the pair in the wait buffer must be compared with entries in the execution monitor. The implementation of this procedure in an FPGA is described in Sect. 5.2.
The address and value pair in the input FIFO queues is processed as shown in Fig. 5 . The address of the incoming pair is compared with the addressees for all entries in the execution monitor. If the address is not found, the pair can be processed without hazard and is thus sent directly to the operational unit. Otherwise, the mechanism compares the addresses in the wait buffer to all entries in the execution monitor. If the address of the pair in the wait buffer does not match that for any entry in the execution monitor, the pair can start computation. They are thus sent to the operational unit and the incoming pair is stored in the wait buffer instead. If there is not such entry in the wait buffer, the incoming pair is stored in an empty entry of the wait buffer. If OoO mechanism stalls, and input from the input queue is halted.
We designed a generalized OoO mechanism by first investigating how applicable subroutines in FaSTAR differ from each other. Each subroutine uses different operations and has a different vector dimension (dimension), which depends on the equation. Two parameters, latency and dimension, are used to characterize the computation. latency represents the number of clock cycles required for data-dependent processing including RAM access, and dimension represents the number of vector dimensions used in the target calculation. For scalar calculation, dimension is one. As shown in Table 1 , the latency of the Max Eigen subroutine differs from that of the others and the Max Eigen and Sum Eigen subroutines use scalar values.
Mechanism
In the OoO mechanism, the vector elements are processed sequentially, clock by clock. The first element of the next vector is input and processed immediately after the last element of the current vector has been processed. The data is streadmed on the basis of the bandwidth between the FPGA and external memory. Thus, a BRAM address is assigned to only the first element of each vector. Figure 6 shows an implementation example of the OoO mechanism with latency = 8 and dimension = 3. That is, the operation unit is assumed to require eight clock cycles for computation, and the target data is assumed to be a threedimensional vector. The execution monitor has entries corresponding to latency since the entries have addresses processed in the pipeline of the operation unit. An execution pointer is used to indicate the register corresponding to the top of the FIFO. The pointer is incremented every clock cycle and returns to zero every eight clock cycles, forming a cyclic buffer. The entry to which it points is changed to 'nil' when the pointer is incremented. The OoO mechanism compares the address attached to the first element of the input vector with all entries in the exectuion monitor every three clock cycles.
Each entry in the wait buffer stores a vector with three elements. Multiple vectors can be stored in the wait buffer because it has a certain number of entries (set), each of which can store a vector. Thus, the total number of registers in the wait bufer is dimenstion × set. The set parameter is the only one the designer can control since latency and dimension are automatically determined by the target subroutine. The larger the value for set, the fewer the number of stalls. However, a larger value means that a larger number of comparators are required which stretches the critical path delay. We examine this trade-off in the 5.3 Subsection. In the example shown in Fig. 6 , set is two: that is, two entries are stored in the wait buffer. If there is no match, the corresponding computation has finished. The content in the wait buffer is thus trasferred to the operational pipeline, and its address is pushed into the execution monitor.
Using Ruby [18] , we designed an OoO generator to generate Verilog-HDL codes with an arbitarary number for latency, dimension, and set. The OoO mechanisms used in the evaluation were generated automatically by this generator.
Operation
The example illustration of OoO mechanism operation in Fig. 7 shows the states of the vectors in the input queue, of the execution monitor, and of the wait buffer for six consecutive three-clock-cycle periods. The N element of vector V is the address of the vector, so the vector is stored at N + M.
Address 0 is used for 'nil.' The first element of vector V1 (V1, 1) is extracted during the first clock cycle (t=0), and V1, V1, V4, V4, V7, and V7 are input in that order. Without the OoO mechanism, stalls would occur three times, resulting in a 5-clock penalty. The OoO mechanism reduces the number of stalls, as shown in Fig. 7. (1) t=0-2: The first address is not matched to any address in the execution monitor during the first (0) clock cycle. The system thus transfers the vector directly to the operational unit and pushes its address into the execution monitor. (2) t=3-5: The next vector is also V1, so the address matches one in the execution monitor. The input vector is thus stored into an entry of the wait buffer. In this case, 'nil' addresses are pushed into the execution monitor. (3) t=6-8: Since V4 does not match any entry in the execution monitor, it is directly forwarded to the operational unit. (4) t=9-11: Address V1 is removed from the execution monitor, so the vector in the first entry of the wait buffer is pushed into the execution monitor. At the same time, input vector V4 is stored in the same entry of the buffer since it matches an entry in the execution monitor. (5) t=12-17 The same operation is repeated for two incoming V7 vectors. In short, with the OoO mechanism, 15 elements are processed in 18 clock cycles while without OoO mechanism, only 9 elements are processed.
Evaluation
The OoO modules were generated for our evaluation by an OoO generator: they were described in the Verilog-HDL language. The ISE 13.2 software was used for synthesis, and the place & route process was used for physical design. Performance was estimated using the NC-Verilog logic simulator. We used a FLOPS-2D reconfigurable platform consisting of several boards and a single Virtex-4 XV4VLX100 FPGA. The boards were connected with nearest neighbor mesh structures using serial links as described in [19] . The OoO modules were implemented in the FaSTAR package along with other modules.
Resource Utilization
The available and consumed resource of the FPGA is shown in Table 2 . We examined the amount of used slice, flipflop (FF), look up table (LUTs), DSP48, and BlockRAM. This is the case that the size of BlockRAM is 8,192 (8 kB) . Total of flux integral module could implement on a Virtex-4 XC4VLX100 FPGA. Comparing the BlockRAMs used in the module with available ones, the size of RAMs in flux summation module is able to be increased. It means that more capacity of RAMs decreases the number of transitions of state.
Resource Overhead
Increasing resource utilization (in terms of time slices used) by the OoO mechanism is shown in Fig. 8 for three of the target subroutines: Sum Eigen, Max Eigen and Coefficient. Sum Eigen consistently used more time slices than Max Eigen. Their rates of increase with the number of sets were similar since they treat scalar data. Coefficient used substantially more overhead than the other two since each set stores a six-dimensional vector. In terms of performance improvement, set = 2 is advantageous for most of the target subroutines when used in FaSTAR. When set was set so that performance was maximized, the OoO mechanism required from 15% to 26% more time slices for each subroutine except Sum Eigen than a comparable implementation without OoO execution. The overhead for Sum Eigen was 47% due to required to maximize performance. If the number of FP-GAs is limited, the set value can be set smaller at the cost of minor performance degradation. OoO mechanism proposed here improves performance more than two times with 15%-26% increasing resource. Although the peak performance may be decreased because less computing modules can be implemented, OoO mechanism proposed here improve the average performance more efficiently than using the same area for computing modules.
6.3 Procedure
Stall Reduction
First, we focus on the number of stalls caused by RAW haz- ards and how they were reduced in number. We can generate various OoO mechanisms by using the OoO generator. In our evaluation, we generated nine designs with the value of set ranging from 1 to 9. Grid data for 11,564 grids with 22,883 faces were used. Stalls occurred in two parallel pipelines in our calculations because the grid access patterns used Cells A and B. "Number of Stalls" means the sum for the two pipelines. As shown in Fig. 9 , the number of stalls was reduced by more than 99% with the proposed OoO mechanism. The "non" on the x axis means the computation was done without the mechanism. Increasing set reduced the number, but the results for Sum Eigen and Coefficient are interesting: the former plot as an elegant curve while the latter plot with a steep drop. This is attributed to the relationship between latency and dimension. The difference is small for the Coefficient subroutine while it is large for the Sum Eigen. When dimension equals latency, RAW hazards do not occur since the first element in the next vector can be processed immediately after the last element in the current vector. If latency is fixed, a smaller dimension means that more wait buffer are needed. If the difference between latency and dimension is increased, a larger value for set is needed to prevent stalls.
Execution Speed-Up
Next, we focus on execution time. The software was compiled with Intel Fortran Compiler 10.4 and run on a workstation (Intel Core 2 Duo CPU, 2.66 GHz, Linux Kernel 2.6.18) without MPI parallel processing. We measured the elapsed times for the five target subroutines in three different computing environments: software execution, FPGAs with in-order execution, and FPGAs with OoO execution. The execution time was measured while each subroutine was called many times in FaSTAR. Synthesized frequency for each subroutine is between 130 -150 MHz. Although increasing the number of sets does not always decrease the operational frequency, it tends to be degraded with the number of sets increases. As shown in Fig. 10 to Fig. 14, the elapsed time was the lowest with the OoO mechanism for all five subroutines. As shown in Fig. 12 , the elapsed time can be reduced for a certain number of sets by reducing the number of stalls. However, since increasing the number of sets stretches the critical design path, increasing the number degrades performance, especially for subroutines with longer vectors, like the Surface Integral. However, the number of stalls of OoO mechanism does not directly influence to the total system performance. As a result, performance improvement saturates at about set = 2 except for Sum Eigen.
In term of throughput penalty because of f max degradation, the OoO mechanism has small effect to system performance. This is because of longer latency and more sets will increase the logical stages of the OoO mechanism. However, using pipeline processing for comparison may reduce the delay, and we plan to investigate its use. On the basis of these result, we estimate that an FPGA can execute the target subroutines in 2.54 sec in total with the proposed mechanism. Without it, the processing time is 6.65 sec, and the elapsed time for software execution is 7.49 sec. This means that a system with an FPGA with the proposed OoO mechanism can execute the subroutines about 2.95 times as fast as a baseline system (Intel Core 2 Duo CPU, 2.66 GHz, Linux Kernel 2.6.18).
Conclusion
Five target subroutines in the FaSTAR CFD package running on an FPGA platform were accelerated by using an outof-order (OoO) mechanism to reduce the number of read-after-write (RAW) hazards and thus the number of stalls caused by the unstructured grid used in FaSTAR. An OoO mechanism with a structure appropriate for the execution unit and wait buffer is automatically generated by adjusting the parameters to match the target subroutine. Application of the OoO mechanism to the five target subroutines showed that the value for set should be two to four of the five subroutines. Use of this mechanism improved performance 2.6 times for in-order execution and 2.9 times for software executed on a workstation with an Intel Core 2 Duo processor. The amount of overhead due to the mechanism was reasonable. Use of a large set value increases the critical path delay. The effect of pipeline processing for address comparison on this delay will be investigated. Additionally, the automatic generator will be evaluated for other packages with different types of equations.
