Abstract-Applications in non-conventional number systems can benefit from accelerators implemented on reconfigurable plat forms, such as Field Programmable Gate-Arrays (FPGAs).
I. INTRODUCTION
A hardware accelerator is a co-processor, or an Application Specific Processor (ASP), that is connected to the computer's Central Processing Unit (CPU) via a standard bus [1] .
Normally, the accelerator is a unit optimized for a set of numerical computations that can run efficiently all the applica tions requiring these computations. Graphics Processing Units (GPUs) are an example of such accelerators. Originally intro duced to off-load CPUs from graphics, GPUs have evolved into general-purpose many-core systems [2] .
The raw power of state-of-the-art GPUs is quite impressive [3] .
However, processing "non-conventional" data, such as very long integers and modular arithmetic used in cryptography, or financial computation requiring the decimal number system, can benefit from Application Specific Processors. These ASPs can be implemented on Field Programmable Gate-Arrays (FP GAs). FPGA accelerators can be fine tuned to match exactly the algorithm, and FPGAs are easy to reconfigure according to the application.
Decimal arithmetic is usually implemented by software rou tines, as binary floating-point does not always round correctly [4] . However, software operations run 100-1000 times slower than the corresponding binary operations implemented in hardware. For these reasons, in the revised IEEE standard 754 [5] support for decimal representation was added, and some companies are already commercializing processors which in clude decimal units [6] , [7] .
In this work, we show that we can accelerate with a decimal processor implemented on FPGA the accounting typically done by telephone companies. As a case study, we consider a telephone billing application: the TELCO benchmark [8] .
The results show that the execution of the benchmark on the FPGA based accelerator is about 10 times faster than the execution on the CPU.
978-1-4577-0516-8/11/$26.00 ©2011 IEEE
II. THE TEL CO BENCHMARK
The TELCO benchmark [8] The pseudo-code of the accounting algorithm is listed in Fig. 1 .
Finally, the benchmark program computes the total for calls cost and applied taxes.
III. THE HARDWARE ACCELERATOR

A. The Hardware Platform
The hardware accelerator is implemented on the Xilinx Virtex-5 LX330T FPGA. This FPGA is embedded on the Alpha Data ADM-XRC-5T2 board and is connected to the host PC via the PCI Express bus (Fig. 2) . The CPU of the host PC is the Intel Core2 Duo processor clocked at 3 GHz.
To transfer data between the board and the PC, Direct Memory Access (DMA) is used. This allows access to the system memory for reading and writing independently of the CPu.
To further improve the performance, burst mode is used which gives the DMA controller exclusive access to the bus without interruption. The implementation of the DMA functions, along if (call type = L) P duration * Brate; else P duration * Drate; Pr = RoundtoNearestEven(P); B = Pr * Btax; 
B. Telco Processor
The TELCO processor ( The structure of the ASP, detailed next, is shown in Fig. 3 .
C. Telco Application Specific Processor
The ASP of Fig. 3 can be considered as divided into three main parts: -Calculation of call price.
-Calculation (in parallel) of Btax and Dtax (if any).
-Calculation of the total cost. 1) Calculation of call price: Although we can transfer 8-digit BCD numbers, six decimal digits are sufficient (106 seconds c::: : 11.5 days) to represent the duration of a call n. The call and tax rates are stored (hardwired) in the processor (FPGA's look-up tables). For the call rate r, selected by a multiplexer depending on the type of the call, a 5-digit fractional number is sufficient.
The product p = n x r is computed by a 6 x 5 BCD multiplier similar to the one described later in Section III-E. The product is a 11 BCD digit number with 5 fractional digit and it must be rounded to p", as explained in Section II, to the nearest cent. adders can be found in [9] . The output is a 8-digit BCD number (6 integer and 2 fractional digits). Six decimal digits for the integer part of the total cost (one million) should be adequate for the application. However, if the number of calls is huge (several millions) the number of digits can be easily extended in the accumulation stage.
D. BCD Addition
The addition of two BCD numbers This can be implemented with any binary prefix net work. 3) A final stage to compute the final sum (modulus 10).
Detail of the modulus 10 adder can be found in [9] .
E. BCD Multiplier
The parallel multiplication shift-and-add algorithm is based on the identity
where for decimal operands Yi E [0,9] and X is a n-digit BCD vector. To avoid complicated multiples of x, the multiplier Y is normally recoded in a way that only the multiples x, 2x and 5x (and the respective negative multiples) are necessary [9] . That is, the multiplier digit is recoded Yi = YHi + YLi with YH E {O, 5, lO} and YL E {-2, 1,0,1, 2} as indicated in Ta ble I, and the partial product is formed by adding the two multiples. For example, if Yi = 8, the corresponding partial product is obtained as A BCD multiplier scheme is shown in Fig. 4 . It consists of four blocks:
1) Precomputation where the multiples of X are computed.
2) Partial product generation where each BCD digit
Yi selects the corresponding multiples (partial product) according to Ta ble I. 3) Adder Tree where all the partial products are ac cumulated by using an adder tree. There are several alternatives for the accumulations of partial products as reported in [9] , [10] , [11] . We opted for the scheme of [9] . 4) carry-save -+ BCD is a carry-propagate adder similar to the one of Section III-D.
F The Hardware Implementation
The accelerator of Fig. 2 is implemented on the FPGA of the Alpha Data board. The ASP of Fig. 3 is implemented with three 8x8 BCD digit multipliers and a final 8 BCD digit carry-propagate adder (CPA). The rates and taxes are chosen by setting multiplexers. The ASP is pipelined into 8 stages (3 stages are necessary for the multipliers), and it can sustain a maximum frequency of 127 MHz. However, the accelerator is clocked at 80 MHz which is the frequency of the DMA. Data are read from the input FIFO buffer into the accelerator and the total (partial) cost is queued in the output buffer and sent back to the CPU for logging, and to verify the functionality of the processor. As the input FIFO buffer can be read every second clock cycle, the effective maximum frequency of operation is 40
MHz (25 ns per element) resulting in a processor latency of 8 x 25 = 200 ns.
Because the FPGA is quite large, only a small fraction of the logic is utilized. However, as the bottleneck is the data transfer from the host computer, the available space of the device cannot be utilized to further parallelize the algorithm.
IV. EXPERIMEN TAL RESULTS
The experiment consists in running the TELCO benchmark on the CPU of the host PC, and in processing the list of call durations on the TELCO processor.
The execution time when running the C program on the CPU is 1.5 seconds for a set of one million calls (elements differently from CPUs and GPUs, the processor can feature special operators (e.g. BCD adders and multipliers).
As a case study, we chose an accounting application requiring the use of decimal arithmetic.
The results of the execution of the benchmark on the FPGA ac celerator are compared with those of the benchmark execution on the CPU of the host Pc. By considering the computation time (decimal part) the accelerator speed-up is about 10 times. The FPGA computation time per element is between 5-7 times the one achievable at the maximum throughput. Therefore, by redesigning the accelerator I/O interface to handle larger data sets, we should be able to increase the ASP throughput and further improve the speed-up over the CPU execution.
