In this article, a unified VLIW coprocessor, based on a common group of atomic operation units, for Quad arithmetic and elementary functions (QP VELP) is presented. The explicitly parallel scheme of VLIW instruction and Estrin's evaluation scheme for polynomials are used to improve the performance. A two-level VLIW instruction RAM scheme is introduced to achieve high scalability and customizability, even for more complex key program kernels. Finally, the Quad arithmetic accelerator (QAA) with the QP VELP array is implemented on ASIC. Compared with hyper-thread software implementation on an Intel Xeon E5620, QAA with 8 QP VELP units achieves improvement by a factor of 18X.
INTRODUCTION
Most scientific applications operate on double-precision (64-bit) floating-point arithmetic. However, its accuracy is not reliable for future large-scale computation applications. In ExaScale computing, accuracy errors increase up to a level where only 11 significant bits in the mantissa are left [Dou et al. 2010] . Therefore, with the IEEE-754-2008 revision of the standard for the binary floating-point arithmetic [IEEE 2008 ], the quadruple-precision floating-point (Quad: 128-bit) format has been introduced, satisfying the demand for numerical precision of scientific applications, such as climate modeling, supernova simulations, and computational geometry [Bailey 2005 ].
elementary functions is more complex than the basic floating-point operation. The computation complexity of common elementary functions with the Taylor method is given by O(n 0.5 · M(n)) , where M(n) refers to the complexity of the multiplication operation and is O(n 2 ) in the general schoolbook scheme, and n refers to the precision of the result. Therefore, the consumption of hardware resources quickly increases with the computation width.
Second, it is unreasonable and impractical to use large number of silicon to implement all Quad elementary functions in a processor individually. This is because a variety of Quad basic operations and elementary functions appear in the same application. For instance, the SGP4/SDP4 application includes five types of elementary functions, that is, division, square root, reciprocal cube root, sine, and cosine. Likewise, the Chirp Scaling algorithm (CS) [Raney et al. 1994 ], a SAR radar image processing algorithm, includes five types of elementary functions, namely division, square root, logarithm, sine, and cosine. The utilization of hardware becomes lower due to the low-usage frequency of these functions. Moreover, it is difficult to extend in order to evaluate other elementary functions with the one-by-one way.
Third, the latency of Quad elementary functions is long because higher internal working precision and more iteration times are required to obtain an accurate result. Therefore, a hardware algorithm with high convergence is chosen to improve the performance, thereby increasing the hardware cost.
Finally, in some application domains, various complex elementary functions, called "coarser-grain functions" in the current article, are constructed with several common elementary functions and basic operations. These functions, such as log 2 (1 + 2 x ) in a logarithm number system, are the basic functions and limit the performance of the applications. Thus, hardware implementation can help improve the performance of applications in such domains. However, it is difficult to satisfy the variety and versatility of coarser-grain functions with the traditional implementation method.
Based on the analysis of hardware algorithm for Quad elementary functions, we summarize a common group of atomic operations that implement various complex elementary functions. Subsequently, we present a unified Very Long Instruction Word (VLIW) coprocessor for IEEE-754 quadruple-precision elementary functions (QP VELP), which is built with multiple atomic operation units. The coprocessor integrates the VLIW architecture's properties, such as low hardware cost, high performance, and high scalability and customizability, thus satisfying the variety and versatility of complex elementary functions for different applications. In the ASIC implementation, several schemes, such as 128-bit truncated multiplier, two-level VLIW instruction RAM scheme, and Estrin's scheme for the evaluation of a polynomial, reduce hardware cost and improve scalability and performance. Finally, the Quad arithmetic accelerator chip, equipped with the QP VELP array, is developed to accelerate Quad elementary functions and Quad applications. The experimental results show that the accelerator chip, running at 1.39 GHz, outperforms the serial and parallel software implementation on an Intel Xeon E5620 CPU with 2.4 GHz by a maximum factor of 95X and 18X, respectively. Moreover, this chip demonstrates significantly lower power consumption.
This article is based on our previous works in Lei et al. [2011] and the following improvements are presented.
-The instruction set architecture of QP VELP is presented, which is composed of a group of atomic operations for Quad arithmetic. -The two-level VLIW instruction RAM scheme is proposed to reduce the memory requirement for VLIW instructions and achieve high scalability and customizability. -Estrin's scheme is proposed to reduce the execution cycle for the polynomial evaluation and improve the performance of QP VELP.
12:4 Y. Lei et al. -Quad arithmetic accelerator, equipped with a QP VELP array, is implemented in an FPGA prototype and ASIC platform. -Quad basic operations, elementary functions, coarse-grain functions, key program kernels, and CS SAR application are implemented to explore the performance and scalability of QP VELP.
The rest of the article is organized as follows. Section 3 presents the summary of the common atomic operations for Quad elementary. Sections 4 and 5 present the implementation details of the QP VELP and the optimization scheme on QP VELP, respectively. Section 6 illustrates the accelerator chip of the QP VELP array. Section 7 presents the comparison of the performance with software implementation. Section 8 concludes and proposes ideas for future work.
BACKGROUND

IEEE-754 Quadruple-Precision Format
The IEEE-754 quadruple precision floating-point format (Quad), also called binary128, is a 128-bit base-2 format with a 1-bit sign (Sx), a 15-bit bias exponent (BEx), and a 113-bit significand (Fx). The significand is written with an implicit lead bit for normalized value, that is, 1 ≤ Fx < 2. In this article, we support the normal number and rounding off the result to the nearest even number. The value of normal Quad number x is given by x = (−1) Sx × Fx × 2 BEx−16383 .
Definition of Operation Types
The operation types are defined as the following.
-Atomic operation. The basic operation units, such as 128-bit fixed computation operations and condition operations, constitute the Instruction Set Architecture (ISA) of QP VELP. -Quad basic operation. The basic quadruple-precision arithmetic operations are defined, such as Quad addition, subtraction, multiplication, and division. -Quad elementary function. The complex quadruple-precision arithmetic operations, such as Quad square root, exponential, logarithm, and trigonometric functions are defined. -Quad coarser-grain function. The complex elementary functions, such as ln (− ln(x) ) and x 2 + y 2 + z 2 , are constructed with several common Quad elementary functions and Quad basic operations.
Hardware Algorithms for Elementary Function
The main methods used for evaluating elementary functions in hardware are classified into two: noniterative and iterative. The first method includes direct lookup table and polynomial approximations, and the second includes digit and function recurrence. Figure 1 summarizes the features of each method.
The direct lookup table method is suitable for low-precision calculations and can obtain the result in one or two cycles. However, the memory requirement increases exponentially with precision. Even for single-precision arithmetic, tables with the size of 2 32 × 32 = 128G bits are required. The lookup table method is generally combined with other methods to balance the speed and hardware requirements of high-precision arithmetic [Muller 2006 ].
Digit-recurrence methods such as SRT and CORDIC algorithm are implemented with simple addition and shift operations. However, it is linearly convergent and requires long latency to obtain a high-precision result.
Newton's algorithm and polynomial approximation are multiplicative approaches used as major tools for high-precision arithmetic. Newton's algorithm [Brent and Zimmermann 2010] computes functional inverse functions, such as reciprocal, division, and square root. Typically, this method has quadratic convergence, resulting in low latency for high-precision calculation. Polynomial approximation is a general technique used in evaluating numerical functions of one variable, such as exponential, logarithm, and trigonometric functions. The degree of the polynomial is proportional to the precision of the result. Numerous multiplication and addition operations must be performed for high-precision calculation. Therefore, a lookup table is always followed by multiplicative approaches in order to reduce computation complexity. The compromise between lookup table size and Newton's iterations number or polynomial degree should be taken into consideration.
ATOMIC OPERATIONS IN ELEMENTARY FUNCTIONS
In this section, we first introduce the computation steps for elementary functions and summarize the common group of atomic operations for such functions. Based on these atomic operation units, a unified VLIW architecture for computing various Quad elementary functions and coarser-grain functions is built. The unified architecture integrates the high-performance, scalability, and customizability properties of the VLIW technique.
Computation Steps for Elementary Functions
The range reduction scheme is used to improve computational efficiency. The evaluation of the function f at x is divided into three steps: range reduction, evaluation, and reconstruction.
(A) Range Reduction
The first step is to transform x from a Quad number to fixed-point format (i.e., Sx, BEx, and Fx). Subsequently, a two-level range reduction scheme is used to reduce the argument range. In the first level, the value of Fx is transformed into a small value Fx [Fx ∈ [0, 2 i )] based on the properties of f , such as periodicity of the trigonometric function. In the second level, the interval [0, 2 i ) is split into 2 k subintervals, namely
where y is the leading k bits of the binary representation of Fx , and z ∈ [0, 2 i−k ). Thus, the evaluation of f (x) is divided into two parts: f (y) and f (z). Then, the value of f (y) is obtained from a lookup table indexed by y . The computation operations, such as multiplication, addition, and shift, are included in the first level. The lookup table operation is the major operation in the second level.
(B) Evaluation
Generally, the value of f (z) is evaluated using the CORDIC algorithm, Newton's method, or polynomial approximation. For the CORDIC algorithm, only the addition and shift are used. Owing to the linear convergence of this algorithm, the number of iterations (N C ) should be greater than 113 to obtain a Quad result. Newton's method is implemented using the iteration approach. The number of iterations (N N ) satisfies 12:6 Y. Lei et al. the expression given by
where Prec i and Prec r denote the precision of the initial value and result, respectively. For polynomial approximation f (z) = N P i=0 a i × z i , the degree (N P ) satisfies
The two equations indicate that the latency of Newton's method or polynomial approximation is inversely proportional to k. However, the size of the lookup table for f (y) is approximately 128 * 2 k . Thus, a suitable value of k should be selected to balance the computational complexity of f (z) and the storage requirement of f (y). Both Newton and polynomial approximation can be divided into a series of multiplications and additions.
(C) Reconstruction
The value of f (x) is reconstructed from f (y) and f (z), using a simple identity transform, consisting of basic fixed-point operations. The normalized result is finally obtained by counting the leading zeros and rounding the rightmost bit of f (x) to the nearest even number.
Common Atomic Operations
At first, atomic operations are defined next. The Appendix describes the atomic operation sequences. Table I summarizes the types of atomic operations for the Quad basic operations and Quad elementary functions, that employ the combination of polynomial approximation (or Newton method) and lookup table. Thus, for each elementary function, a LOOKUP T operation reduces the iteration of Newton's method or the degree of polynomial, thereby improving the computational efficiency. In the evaluation of the Quad addition, logarithm, and sine/cosine functions, the subtraction of two nearly equal numbers may occur. This process leads to the uncertain number of leading zeros. Thus, the Count LZ and SHIFT operations are executed before normalizing the mantissa into the Quad format.
Based on the preceding analysis, we can decompose the Quad basic operations and elementary functions with a common group of atomic operations. For example, in the atomic operation sequence for Quad multiplication (as shown in the Appendix), multiplication is used to compute the product of mantissas, while an addition is used to compute the exponent. Finally, a normalization operation normalizes the product and exponent into the Quad format. Several types of computation and condition operations are employed in the two-level range reduction performed for the Quad exponential function, as shown in Figure 3 (a). Only the computation operations are included in polynomial evaluation, whereas the lookup table and normalization operations are used in reconstruction. This is the same for other Quad elementary functions. Thus, it is now possible to implement the Quad arithmetic in a unified hardware framework which consists of the multiple atomic operation units.
Advantages of the VLIW Architecture in Supporting Atomic Operations
The atomic operation sequences and data dependencies between these atomic operations vary for different elementary functions. Thus, organizing and utilizing these atomic units effectively is the key technique in designing the hardware structure of a Quad arithmetic. Moreover, realizing the property of the scalability for new functions is another important consideration in ensuring the variety and versatility of complex elementary functions for different applications. The VLIW technique [Fisher 1983 ] is effective in exploiting the Instruction-Level Parallelism (ILP). The technique simultaneously executes numerous operations on multiple basic units based on a fixed schedule sequence. The advantages of the VLIW technique listed next are oriented well to the unified coprocessor for Quad arithmetic based on atomic operation sequences.
-VLIW architecture offers significant computational power with less hardware resources. Since the operation scheduling and conflict resolution are handled by the compiler, the VLIW processor does not need the complicated scheduling hardware. Thus, more logic can be used to build the high-precision arithmetic units that are used to implement the atomic operations. -The static scheduling and optimization of the VLIW architecture efficiently exploit the ILP for the fixed hardware algorithms of Quad basic operations and Quad elementary functions. It is possible to schedule the atomic operations in the design of these functions, such that the data dependencies between atomic operations are maintained under the control of custom VLIW instruction. The resulting explicitly parallel technique of VLIW instruction and pipeline parallel scheme can lead to improved performance. -The VLIW processor can achieve high scalability and customizability. Under the control of specific VLIW instructions, the atomic operation units are combined into special-purpose hardware for Quad elementary functions. Moreover, this custom VLIW processor can be easily extended to implement Quad coarser-grain functions for some domains. Moreover, extending the processor is more efficient than a generalpurpose processor implementation.
IMPLEMENTATION OF THE VLIW COPROCESSOR
This section presents the organization of the VLIW coprocessor for the IEEE-754 Quad elementary functions (QP VELP), the idea of which is similar to the concept of horizontal microcode in Wilkes [1951] . The coprocessor uses the unified hardware to evaluate various Quad basic operations and Quad elementary functions.
The QP VELP Structure
Figure 2(a) shows that the QP VELP mainly consists of a parallel computation module, a control state machine module, a two-level VLIW instruction RAM module, and an update VLIW instruction module. The control state machine module controls the running of QP VELP via the change of two program counters (PC1 and PC2). The update VLIW instruction module controls the initialization of VLIW instruction RAM and allows it to support new elementary functions and key program kernels. The evaluations of Quad basic operations and Quad elementary functions are accomplished by the parallel computation module through the explicitly parallel technique of the custom VLIW instruction. This module consists of four on-chip memory modules, control logic, and multiple basic 128-bit fixed-point arithmetic units.
The four on-chip memory modules comprise the lookup table, ROM-C, and two RAM-M. The Lookup- Table, which is a 512×10-bits ROM with a single port, stores the initial approximation for Newton's method based on the lookup table method, such as the initial reciprocal value for the division operation and the approximation of reciprocal square root for the square root operation. ROM-C, a 512×128-bits ROM with two read ports, stores constants such as the coefficients of polynomial and the initial value for the polynomial approximation method such as exponent and logarithm. Each RAM-M is a 32×128-bits RAM with two read/write ports. So, two RAM-Ms provide four 128-bit read or write ports and serve the role of register file as in a general-purpose processor to buffer the middle data. These on-chip memory modules are the major data sources for the atomic operation units.
The atomic operation units are divided into the functional and computational groups. The first group, which includes the lookup table operation, the count leading zero operation, and the normalize operation, appears once for a given elementary function. Thus, only a single unit is instanced in the QP VELP unit. In Figure 2 (a), the atomic operations in the computational group, which includes the 128-bit multiplier (n M ), adder/subtracter (n A/S ), and shifter (n S ), are executed frequently. The 128×128 pipelined truncated multiplication unit is the largest hardware module in QP VELP and occupies more than 40% of the area. However, Estrin's scheme for the polynomial, described in Section 5.2, fully utilizes the pipeline parallel of the 128-bit multiplier. Thus, n M = 1 achieves good area-performance balance. The utilization of this multiplier is greater than 50%, as shown in Figure 4 (e). The slightly lower hardware costs for the adder/subtracter and shifter, two adder/subtracters (n A/S = 2), and two shifters (n S = 2) are employed to improve the performance. The latency of the 128-bit multiplier is 6 cycle and that of the other atomic operation units are all one cycle.
A fully associative bypass network, as the data bus shown in Figure 2 (a), is designed to connect these on-chip memories and atomic operation units. This network is composed of ten 128-bit 8:1 selectors to provide operands for atomic operation units, and occupies about 9% of the total area in QP VELP. However, this network can provide 1280-bit internal bandwidth, so these operation units can be executed in parallel to improve the performance of QP VELP. 
128-Bit Truncated Multiplier
The QP VELP uses a 128-bit internal interface and computation precision. It is required that the error should be less 2 −128 that for all basic operation units. Therefore, a 128-bit truncated multiplier is presented to reduce resources, delay, or power consumption. The idea is to save the computation of some of the less significant columns in the multiplication array, under the condition where the error is smaller than 2 −128 . Next, we determine the maximum number of calculated columns, denoted by ( p + k) in Figure 2 (b). The error of truncated multiplier comes from two sources: approximation error (ε A ) and rounding error (ε R ). Here, ε A is the sum of the truncated columns that 12:10
And ε R is the error of rounding the ( p + k)-bits intermediate result to p bits, with the rounding to nearest mode that satisfies ε R ≤ 2 −129 . Thus, ε A ≤ 2 −129 and d ≤ 120. Figure 2(b) shows that the 128-bit truncated multiplier uses eight guard bits. It consists of a 64×64 multiplier, two 32×32 multipliers, four 16×16 multipliers, twentyfour 8×8 multipliers, and some logic. The area of this multiplier is approximately 56% of the full 128-bit multiplier.
Two-Level VLIW Instruction RAM Scheme
The VLIW instruction RAM stores the custom VLIW instruction sequence for each elementary function. It is an important part that ensures the good performance, flexibility, and hardware cost of the QP VELP. In the current study, a two-level VLIW instruction RAM scheme is presented, as shown in Figure 2 (a).
The first level of VLIW instruction RAM, called the compressed VLIW instruction module, stores the custom VLIW instruction sequences for Quad basic operations (Quad addition and Quad multiplication), common elementary functions (Quad division, square root, exponential, logarithm, sine, and cosine), and parts of coarser-grain functions. Since Quad basic operations and common Quad elementary functions are the basic components for Quad applications, the VLIW instruction sequences for these Quad operations are solidified into QP VELP with ROM. Thus, QP VELP can accomplish basic Quad arithmetic without updating VLIW instructions.
In order to enhance the scalability of QP VELP, part of the compressed VLIW instructions RAM is implemented with RAM. Special VLIW instruction sequences can be custom for some Quad coarser-grain functions, which are the basic functions in one domain. The Update VLIW instruction module loads these sequences into VLIW instruction RAM. Therefore, the QP VELP can implement these coarser-grain functions with high efficiency.
The second level of VLIW instruction RAM is designed for the extension of some coarser-grain functions and key program kernels on QP VELP. For example, the coarser-grain function e x − x can be divided into two continuous Quad operations (Quad exponential and Quad addition), so the VLIW instruction sequences for Quad exponential and addition are called ordinally to evaluate e x − x. The second level is built with an address sequence RAM, which stores the initial address of scheduled functions in compressed VLIW instructions RAM. Thus, each word in this level includes a control field and an address field, as shown in Figure 2(c) .
The two-level VLIW instruction RAM is reloaded by the update VLIW instruction module to implement coarser-grain functions and key program kernels for some applications. The width of this configure path is 64 bits.
A. Compression scheme for VLIW instruction
The code density is the primary disadvantage of the VLIW architecture. In each cycle, the one-to-one correspondence among functional units should be kept in the custom VLIW instructions. This process introduces the "horizontal" no-ops for unused units, which are the main sources of inefficiency for code density. Therefore, in the QP VELP design, a code compression scheme based on the flag is presented in the first level of VLIW instruction. We built one flag, called Sel Fu, for each functional unit in the U se Flag RAM. The Sel Fu was set to 1 if the corresponding functional unit was used in this cycle. Otherwise, it was set to 0. A control signal RAM was built for each unit (Fu i ), which merely stored the used control signal, indicating that the corresponding flag was valid (Fu i = 1). The address generator logic was then built to access the corresponding signal RAM. The control field for each unit is defined in Figure 2(c) , where sel x is used to select the data source for the computation unit, addr x is the address field for on-chip memory, A/S is the selecting signal for addition or subtraction, and en a is the written enable signal.
There is a need to add a flag and address generator logic for each functional unit in the proposed code compression scheme. However, the distributed control signal RAM for each functional unit is advantageous to the ASIC placement and layout. Moreover, this scheme only stores the control signal for a used functional unit; thus, it can avoid wasting the VLIW instruction RAM fully. The compression ratio is relative to the frequency of use (P 
Implementation of Elementary Functions
We built the atomic operation sequence to ensure the scalability and customizability of QP VELP. Each elementary function was considered to be in isolation. Then, we customized the corresponding VLIW instruction sequence, which was integrated into QP VELP, to yield the desired result and achieve high performance. In the subsequent discussion, we took the Quad exponential function as an example to illustrate the details of customizability for elementary functions.
Step Table ( 
(C) Tradeoff of area and performance function with a series of atomic operations that can be executed by basic arithmetic units.
For example, the Quad exponential algorithm (y = e x ) is based on the lookup table and the Taylor series. As shown in Figure 3(a) , the atomic operation sequence consists of three steps. The two-level range reduction scheme was carefully designed to guarantee the precision of results. The trade-off is between the size of lookup table and the evaluation latency of the Taylor series, which is used in Estin's scheme. The precision of an initial value of 6 indicates better balance, thus, the lookup table is organized into a 64 × 128 ROM and the degree of polynomial is set at 15.
Step 2, Mapping the atomic operation sequence to the VLIW instructions. In this step, we mapped the atomic operation sequence into the custom VLIW instructions in accordance with the data dependence among them. Through static scheduling and optimization, we used as few VLIW instructions as possible to obtain the lowest latency. As in the exponential example in Figure 3(b) , and the schedule of polynomial approximation with Estrin's scheme, the pipeline of the multiplier and the explicitly parallel atomic operation units are used to improve performance.
Step 3, Packing VLIW instructions into the hardware functions. We used two methods to pack the custom VLIW instructions into the QP VELP. For common Quad elementary functions, the corresponding VLIW instructions are solidified into ROM in the second level of VLIW instruction RAM. For the extension of coarser-grain functions and key program kernels, the VLIW instructions are written into the first (or second) level of VLIW instruction RAM with the help of the update VLIW instruction module.
OPTIMIZATION SCHEMES ON QP VELP
In this section, we improved the performance and scalability of the QP VELP by introducing Estrin's scheme for polynomials and by using the extension scheme of coarsergrain functions, respectively.
Estrin's Scheme for Polynomial
As described in Section 2, polynomial approximation is a significant method for the evaluation of elementary functions. In QP VELP, two adders with one stage pipeline and a 128-bit pipelined multiplier with six stages evaluate the polynomial p(x) = a 0 + a 1 x + · · · + a n x n , where n is polynomial degree. Polynomials can be evaluated using numerous schemes, such as Direct, Horner, and Estrin's schemes.
The Horner scheme is written as
and is implemented using fused multiply and add blocks. It requires the least computation cost (n additions and n multiplications). However, the Horner scheme is a sequential structure that necessitates in serial the addition, shift, and multiplication operations. Thus, 7n cycles are required for the Horner scheme on QP VELP. For the Direct scheme, 2n-1 multiplications are required to evaluate the components (x i and a i x i , 1 ≤ i ≤ n). However, more internal parallelism can be exploited with a pipelined multiplier. The evaluation order as illustrated in Figure 3 in Lei et al. [2011] can be used to schedule as many operations as possible between two data dependence operations to reduce data hazard stalls and keep the pipeline full. As shown in Figure 4 (c), when the degree of a polynomial (n) is greater than 8, 2n + 12 cycles are required to evaluate this polynomial.
Estrin's scheme attempts to overcome the serialization of the Horner scheme while still being reasonably close to the optimal in terms of the number of multiplications. This scheme can be written as [Muller 2006 ]
where the number of multiplication operations is n+ log 2 (n) as shown in Figure 4 (d), and the multiplication and addition can be performed in parallel or in a pipelined fashion. Similar to the Direct scheme, to keep the pipeline full, parallelism among the multiplications and additions must be exploited by finding sequences that can be overlapped in the pipeline. First, we build the dataflow graph for Eq. (1), where the nodes refer to operations in Eq. (1), and the weight of the edge is the latency of the corresponding operation. The graph of a degree-7 polynomial is taken as an example in Figure 4(d) . Then, the maximum latency of each operation to the final result is evaluated through the modified DAG-shortest-paths algorithm [Cormen et al. 2001 ]. Finally, we schedule the operations according to the maximum latency and the data dependence among them. Thus, the operations in the critical path are executed first to obtain a lower bound on the latency. Figure 4 (c) shows the polynomial execution cycle using the three schemes. Note that Estrin's scheme executes fewer multiplication operations; thus, it is better than the Direct scheme (14 versus 21) and is employed in the current study.
Extension Scheme for Coarser-Grain Functions
We can easily customize the VLIW instruction sequences in the QP VELP to evaluate coarser-grain functions for several special application domains. Customization can be achieved in two ways.
For some coarser-grain functions, it is composed of multiple Quad basic operations and common elementary functions, such as log 2 (1 + 2 x ) = ln(1 + e x·ln 2 )/ ln(2) in the logarithm number system and ln(− ln(x)) in the Monte Carlo algorithm. The first method is to call the multiple VLIW instruction sequences of common elementary functions, and initializes the call sequences in the second level of the VLIW instruction RAM. Sequential execution is required for the multiple common elementary functions. For example, to evaluate y = ln(1 + e x·ln 2 )/ ln(2), we can execute the VLIW instruction sequences of Quad multiplication (t1 = x · ln 2), Quad exponential function (t2 = e t1 ), Quad addition (t3 = 1 + t2), Quad logarithm function (t4 = ln(t3)), and Quad multiplication (y = t4 * (1/ln (2))) , and x 2 + y 2 + z 2 , which is used to evaluate the distance or velocity in three-dimensional simulation. Compared with the first method, this approach achieves higher performance through one-time reduction of the normalization operation and the scheduling of more ILP in the design of the VLIW instruction sequence.
Moreover, the QP VELP provides a scalability mechanism to accelerate the Quad key program kernels. The implementation method is the same as the first method for Quad coarser-grain functions. In our experiment in Section 7, we use the QP VELP to evaluate key program kernels in the SAR radar application.
ACCELERATOR CHIP OF THE QP VELP ARRAY
This section presents the structure and ASIC implementation of the accelerator chip with a QP VELP array.
Structure of Accelerator Chip
The performance of Quad arithmetic is improved using a Quad arithmetic accelerator chip (QAA) equipped with a QP VELP array; this is used to evaluate batch Quad functions. Figure 5 shows that this accelerator mainly comprises the following modules: task allocation, result recycle, PE array, large result buffer, DMA controller, and other control logic.
The task allocation module assigns computation tasks to each PE in a round-robin mode. This is meant to balance the tasks and guarantee that results are evaluated in order. At the same time, the task allocation module writes the number of executed PE into Result Order fifo. The result recycle module receives the results from the result FIFO, specified by Result Order fifo, and writes them into the result buffer, which is a 8192 × 128 RAM with one port.
The PE processor array consists of 4n FIFOs (n initial data FIFOs, n result FIFOs, and 2n function FIFOs) and n QP VELP units. At the start, each QP VELP initiates the same functions in VLIW instruction RAM. Therefore, these units can independently complete Quad elementary and coarser-grain functions, as well as key program kernels.
The active DMA mode implements data communication between the Quad arithmetic accelerator chip and the host. The DMA controller module controls the accelerator and the data transfer between memory and the accelerator chip without occupying CPU time. The implementation procedure is described next. In summary, the running time of the Quad arithmetic accelerator (T ACC ) includes time for the startup DMA (T Startup DM A ), computation (T Cal ) reading the original data from the memory to the accelerator (T Read ), and writing the results in the memory (T Write ). Generally, the start-up DMA time is the latency between the CPU and accelerator chip and is smaller than other parts. With the DMA controller, reading operations can overlap with the accelerator calculation, so that the total running time is
For the evaluation on large data, T Startup DM A is far less than the max{T Read , T Cal } and T Write . Thus, in this case, the running time is proportional to amount of calculated and transmission data.
Synthesis of Accelerator Chip on ASIC
The proposed hardware design, including the QP VELP unit and Quad arithmetic accelerator chip, is implemented on ASIC using structural-level Verilog HDL, which is simulated with ModelSim 6.5h, and synthesized using Synopsys Design Complier with TSMC 45-nm CMOS standard cell libraries respectively. All the syntheses on ASIC are tuned for delay optimizations with maximum effort and reported after placement and routing.
Table II provides the area, power, and delay estimation results of the QP VELP unit; it also shows the Quad arithmetic accelerator chip equipped with a varying number of QP VELP units. Although the design optimization technique reduces the hardware cost, the area for the 128-bit pipeline truncated multiplier takes the largest area consumption module, occupying roughly 42.5% of the QP VELP. Meanwhile, most of the other areas implement the on-chip memories in the parallel computation module and the two-level VLIW instruction RAM. The achievable maximum frequency of QP VELP unit is 1.43 GHz. Meanwhile, the critical delay occurs in the first stage of the 128-bit multiplier to generate multiple partial products. The QP VELP unit can be exploited for a deeper pipeline by performing more stages in order to achieve higher clock frequency. However, evaluating polynomial and elementary functions requires more cycles, thereby enlarging the latency for the given functions.
For the Quad arithmetic accelerator chip with 4 QP VELP units and with 8 QP VELP units, 64.6% and 78.1% of area is used to instantiate the QP VELP units, respectively. In order to reduce the area cost, the result buffer is implemented with one port RAM. It consumes about 21.1% of area for an accelerator chip with 8 QP VELP units. The maximum power dissipation of 1.61 W for the 8 QP VELP units is much lower than the 80 W consumed by an Intel Xeon E5620 CPU.
In the following, this accelerator with 8 QP VELP units, run at 1.39 GHz, is employed to accelerate scientific applications with Quad arithmetic.
EXPERIMENTS
We verified the correctness of the proposed designs through a prototype of the QP VELP unit and Quad arithmetic accelerator with the QP VELP array built on the FPGA board. Both the FPGA prototype and ASIC accelerator chip use active DMA mode to transfer data between host and accelerator. The differences lie in the bandwidth and latency. The performance of the ASIC implementation is predicted on the condition that the bandwidth and latency between the host and ASIC accelerator are similar to the DDR3 1333 MHz memory, that is, bandwidth is about 10GB/s and latency is about 65-ns [Kaseridis et al. 2011] .
Next, we applied the 256-bit arithmetic in MPFR library [Fousse et al. 2007 
FPGA Prototype
As shown in Figure 6 , the FPGA prototype mainly consists of two FPGAs (i.e., Virtex-5 XC5VLX50T-1FF1136 and Virtex-6 XC6VLX760-1FF1760) and two 2GB DDR2 DRAM modules. The XC6VLX760 chip implements the proposed design. The XC5VLX50T chip provides a link between the XC6VLX760 and host PC through an eight-line PCI-express bus with a bandwidth of 900MB/s.
The QP VELP unit and Quad arithmetic accelerator with eight QP VELP units on XC6VLX760-1FF1760 FPGA synthesized by ISE 12.3 achieved maximum frequency of 225 MHz and 198 MHz, respectively, which are much lower than that of the ASIC implementation, because the circuit is emulated with a large number of configurable elementary blocks and network of wires on FPGAs.
Precision and Performance of QP VELP
Precision comparison. For each function, a corresponding VLIW algorithm is designed, in which the internal precision is carefully planned and the guard words are used to guarantee the results' accuracy, thus faithful rounded results [IEEE 2008] , the error of which is smaller than 1 ULP (unit of the last place), can be obtained for all inputs. However, it is very difficult to guarantee correctly rounding, in which the error is smaller than 0.5 ULP, for all inputs because of the so-called Table Maker's Dilemma [Lefevre et al. 1998 ], which requires exceedingly high intermediate accuracy to obtain a correct rounded result. Specifically, Lefever and Muller reported that the worst case in correct rounding requires 158 bits of the intermediate accuracy in the double-precision exponential function [Lefevre et al. 2001] . Therefore, in the current study, one bit error may occur due to the rounding operation, and the percentage of correct rounding is over 99% for the basic Quad operations and common Quad elementary functions.
For the second implementation method of coarser-grain functions, such as the CF1-CF5 shown in Figure 7 (a), the VLIW instruction is customized similar to that of the common elementary functions; moreover, rounding operations are executed only once. Thus, a faithful rounded result (one bit uncertainty at most) is guaranteed for all inputs. For the first implementation method, such as CF6-CF9, a two-bit uncertainty occurs for a very small part of the results due to the accumulation of rounding errors.
Performance comparison. Table III compares the performance of the Quad basic operation and common elementary functions on an Intel CPU and ASIC.
The Intel Fortran performance in Table III is the average time of 100,000,000 random inputs. Compared to this performance, one QP VELP unit with 1.4 GHz can achieve a 5.7x average performance improvement. The current study used the polynomial approximation (or high-order Newton method) with high convergence for hardware implementation, and maximized the highest possible level of parallelism with Estrin's Lei et al. [2011] , the execute cycle for Quad elementary functions is reduced by a factor of 2 to 12 due to Estrin's scheme for the polynomial evaluation. If one QP VELP module is integrated into the CPU for Quad arithmetic and the datapath of CPU is 128 bits, we can read a 128-bit data from the register file of the CPU or write a 128-bit result to the register file in one clock cycle. Therefore, for twooperand basic Quad operations in QP VELP three cycles are added to read two 128-bit operand and write one 128-bit result, and to one-operand Quad elementary functions in QP VELP two cycles are added to read one 128-bit operand and write one 128-bit result. The average performance boost with this hardware module will be 5.2×.
Due to the strong data dependency between the basic operations for Quad elementary functions, the available ILP is limited. Therefore, the Quad elementary functions are latency-dominated kernels. In the current article, Estrin's scheme is employed to provide more ILP for the evaluation of the polynomial and a pipeline parallel schedule is used in QP VELP to improve the utilization of hardware. The average ILP for common Quad elementary functions is about 1.1 and the utilization of 128-bit truncated multiplier reaches 41.4%.
There are many NOP operations in our scheduled VLIW instructions due to lowering of ILP, so the two-level VLIW instruction RAM scheme in QP VELP for common Quad elementary functions greatly reduces the storage requirements for VLIW instructions, and the average compression ratio for these functions is approximately 42.3%
Performance of Quad Arithmetic Accelerator
In this subsection, we evaluated the performance and scalability of the Quad arithmetic accelerator using Quad basic operations, elementary functions, coarser-grain functions, and several key program kernels on a set with 10,000,000 random inputs. We also used the SAR radar application on a 4k × 4k SAR row data. The Intel Xeon E5620 CPU supports a hyper-threading technique (i.e., four physical cores and eight virtual cores). Thus, in order to compare the performance of the Quad arithmetic accelerator, we implemented two versions of software approaches, namely, single-thread and eightthread based on the OpenMP parallel technique. For multithreading implementation, the test dataset is divided into eight one-dimensional arrays with the same size, and the Sections directive in OpenMP is used specify the enclosed section for the evaluation of the elementary function in one test array. 
(A) Quad Basic Operations and Elementary Functions
Comparing the single-thread and the eight-thread parallel software approaches, the Quad arithmetic accelerator achieved an average speedup factor of 25.9 and 5.1, respectively, as shown in the front part of Figure 7(a) . However, the speedup for Quad addition and multiplication operations is smaller than that for Quad elementary functions. Since the overhead for communication is bigger than computation overhead, the performance for Quad basic operations is limited by the IO bandwidth. As shown in Figure 7 (d), the ratio of execution time to the total running time for Quad addition and multiplication are 22% and 16%, respectively.
(B) Coarser-Grain Functions
A larger computation granularity induces a smaller ratio between the communication and computation overheads. Thus, performance of coarser-grain functions is better than that of the Quad elementary functions. As shown in Figure 7(d) , the average ratio of execution time to total time is 82.6% for coarser-grain functions. This accelerator outperforms the serial and the parallel software approach by factors of 27.6 to 94.9 and 5.5 to 17.9, respectively. The performance of the second way in the implementation of coarser-grain functions (CF1-CF5) is higher than that of the first way (CF6-CF9). This is because more atomic operations can be scheduled and more ILP can be exploited in the VLIW instruction sequence design.
(C) Key Program Section and CS Application
We used several key program kernels in two applications (i.e., UZ1 and CS), to evaluate the scalability of the Quad arithmetic accelerator on a larger computation granularity. The two key program kernels in the UZ1 application, which is used to solve unbalanced stiff equations and simulate the nonequilibrium phenomena of radiation, include division, square root, exponential and logarithm functions, as described in Zhou et al. [2008] . The proposed accelerator outperforms the parallel software approach by factors of 10.3× and 12.5×, respectively.
As shown in Figure 8 , the program flow diagram of the SAR radar algorithm (Chirp scaling algorithm) consists of four two-dimensional (2D) FFT, three compression factor generators, and three factor compression computations. The key operation in the 2D FFT is the butterfly computation, and that in factor compression is the complex multiplication. Figure 7 also presents the equations for the three compression factors. Therefore, the entire SAR radar application can be decomposed into the aforesaid five key program kernels.
Compared with three compression factor generations, the FFT butterfly and complex multiplication are merely composited with Quad addition and multiplication operations. In addition, the computation granularity is small. Therefore, the FFT butterfly and complex multiplication demonstrated lower performance than the factor generations. However, relative to only the addition and multiplication operations, the ratio between the communication overhead and computation overhead is reduced. Thus, the Quad arithmetic accelerator outperforms the parallel software by a factor of 7.7×.
In the SAR radar application on a 4k × 4k SAR row data, the five key program kernels have a ratio of 18:3:1:1:1. The FFT butterfly computation serves as a significant factor in the entire SAR radar application. Therefore, the performance of the Quad arithmetic accelerator for SAR radar applications is similar to that for the FFT butterfly computation. However, since the data dependency is more complex in the SAR radar application, the speedup between parallel and serial software is only 3.7×. Therefore, the proposed accelerator can outperform the parallel version by a factor of 9.3×.
CONCLUSIONS
In this article, we have presented a custom VLIW coprocessor for the IEEE-754 Quad basic operations and elementary functions in the unified hardware, equipped with multiple basic atomic operation units. Several optimization schemes, such as the 128-bit truncated multiplier, the two-level VLIW instruction RAM scheme, and Estin's scheme for polynomial evaluation, are proposed to achieve high performance, scalability, and customizability with less hardware cost. Finally, the FPGA prototype and ASIC implementation of the Quad arithmetic accelerator are built with a QP VELP array.
The proposed accelerator provides advantages in terms of better performance and power consumption in the evaluation of Quad elementary and coarser-grain functions, as well as Quad applications.
Our future work aims to explore the feasibility and potential performance improvement of equipping the processor with a Quad arithmetic accelerator. Moreover, we will further explore the autogeneration technique of VLIW instructions in a compiler.
