Abstract: With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms. However, with the rapid advances in very large scale integration (VLSI) technology, semi-and fully parallel hardware decimal multiplication units are expected to evolve soon. The dominant representation for decimal digits is the binary-coded decimal (BCD) encoding. The BCD-digit multiplier can serve as the key building block of a decimal multiplier, irrespective of the degree of parallelism. A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. We provide a novel design for the latter, showing some advantages in BCD multiplier implementations.
Introduction
Decimal computer arithmetic is preferred in decimal data processing environments such as scientific, commercial, financial and Internet-based applications [1] . Ever growing needs for processing power, required by applications with intensive decimal arithmetic, cannot be met by conventional slow software simulated decimal arithmetic units [1] . However, their hardware counterparts as an integral part of recently commercialised general purpose processors [2] are gaining importance. Binary-coded decimal (BCD) encoding of decimal digits has conventionally dominated decimal arithmetic algorithms, whether realised by hardware or in software.
The research for hardware realisation of decimal arithmetic is not matured yet and there are rooms for improvements in hardware algorithms and designs. For example, the state-of-the-art BCD multipliers, for computing X Â Y, use iterative multiplication algorithms [3, 4] , where the partial products (i.e. the product of one BCD digit of the multiplier Y times the multi-BCD-digit multiplicand X ) are generated one at a time and added to the previously accumulated result. Each partial product may be directly generated as one BCD number in [0, 9] Â X, or may be composed of few easy multiples of the multiplicand (e.g. 7X ¼ 4X þ 2X þ X ) [5] . The latter approach tends to increase the depth (measured by the maximum number of equally weighted BCD digits) of partial product tree per each BCD digit of multiplier, which in general leads to slower partial product accumulation. But, by using possibly fast and low-cost BCD digit by BCD-digit multipliers, the former approach may lead to less costly BCD multipliers.
Erle et al. have enumerated three reasons for using decimal digit-by-digit multipliers for partial product generation, which leads to less number of cycles, less wiring and no need for registers to store multiples of the multiplicand [4] . With the rapid advances in VLSI technology, semi(fully)-parallel BCD multipliers will soon be attractive, where more than one (all) partial product(s) are generated at once and accumulated in parallel. An integral building block of a BCD multiplier, whether realising a sequential, semi-or fully parallel multiplication algorithm, can be the BCD-digit multiplier. Alternative approaches are based on either slow accumulation of easy multiples [5] , or costly retrieval of product of BCD digits from look-up tables [6, 7] .
2
General BCD multiplication A general conventional paper and pencil view of decimal multiplication is depicted in Fig. 1 , where in this figure and throughout the paper uppercase (lowercase) letters are used for decimal (binary) digits. Each decimal digit product X i Â Y j is represented by two decimal digits P ij h and P ij l such that the former weighs ten times as much as the latter; hence, h (for high) and l (for low) superscripts. Using BCD encoding for all decimal digits of Fig. 1 leads to a general BCD multiplication scheme, for which a hardware implementation may be achieved by one of the following sequential, semi-or fully parallel approaches.
Sequential realisation
The product is generated in a register initialised to zero. In each iteration the multi-digit multiplicand is multiplied by one decimal digit of the multiplier, and the resulted partial product is accumulated in the product register, followed by a digit right shift. Depending on the partial product generation approach, to be discussed later, there may be equally weighted BCD digits in the representation of a single partial product (e.g. P 01
h and P 02 l and similar pairs in Fig. 1 rise to two deep partial products). Therefore the accumulation step may actually be equivalent to a multi-operand BCD addition. Fig. 2 depicts an abstract exemplary hardware realisation, where the three-operand BCD addition box receives a two-deep partial product and the one-deep accumulated result.
Semi-parallel realisation
This is similar to the previous one, except that in each iteration, more than one digit of multiplier takes part in partial product generation, which leads to a deeper multioperand BCD addition. A two digit at a time realisation with a five-operand addition is depicted in Fig. 3 .
Fully parallel realisation
Here, all partial products are generated at once and reduced together to two partial products to be added by a BCD adder. This case may be illustrated as in Fig. 1 .
The problem of multi-operand BCD addition, as needed in the realisation of partial product reduction/accumulation, is generally discussed in [8] . But, for BCD partial product generation, one can think of two approaches:
BCD digit multiplication
This follows the conventional paper and pencil approach, but a two BCD digit product may be looked up in a table addressed by the bits of two BCD digits of the multiplier and multiplicand [7] , or a direct BCD digit by BCD-digit multiplier may be realised.
Precomputed easy multiples of the multiplicand
A straightforward approach is to generate all the ten possible multiples of the multiplicand at the outset of multiplication process. Then, in each iteration, a ten-way selector controlled by a BCD digit of the multiplier selects the appropriate partial product and adds it to the so far accumulated result.
To reduce the number of precomputed multiples, to save multiple generation and selection hardware, a clever design is presented in [3] , where only two, four and five multiples of multiplicand are precomputed. With these easy multiples and the multiplicand itself, all the required ten multiples can be derived by at most one carry-free BCD addition without necessarily using any redundant BCD representation. The clever observation, leading to selection of the latter three multiples, is that each BCD digit of the multiplicand when multiplied by 2, 4 or 5, results in a pair of decimal carry and a BCD digit (c, W ¼ w 3 w 2 w 1 w 0 ), such that for multiple 2, c 1 and w 0 ¼ 0; for multiple 4, c 3 and w 1 ¼ w 0 ¼ 0, and for multiple 5, c 4 and W is either 0 or 5. The latter characteristic guarantees that addition of carries to the equally weighted BCD digits will not generate any further carries.
BCD digit multiplication
The BCD encoding of decimal digits [0, 9] maps the latter set to [0000, 1001] such that x 3 x 2 x 1 x 0 as the BCD encoding of X (0 X 9) satisfies the arithmetic equation
. The function p may be realised, in a straightforward manner, by an eightinput, eight-output combinational logic or a 256 Â 8 look-up table. But practical constraints on area and latency call for more optimum designs. An alternative design may use a standard 4 Â 4 unsigned binary multiplier generating an 8-bit binary output, which should be corrected to two BCD digits, with the same arithmetic value. Given that the product value belongs to [0, 81], its most significant bit (weighted 2 Fig. 4 depicts the regular partial product generation and reduction process of this multiplier.
In binary parallel multiplication, there are several techniques for partial product reduction (e.g. [9, 10] ) and final product computation. For wide word operands (e.g. popular 54 Â 54 bit multipliers [11] ), the latter techniques show considerable efficiency. But in decimal multiplication, because of particularities of using radix 10, which is not a power of 2, one needs to generate BCD partial products to be followed by BCD multi-operand addition. Therefore we need localised reduction trees, as in Fig. 4 , per each BCD-digit multiplication and alternative customised reduction techniques for better performance, to be discussed in Section 4. But, in this section, we proceed with converting the binary product p 6 p 5 p 4 p 3 p 2 p 1 p 0 to its equivalent BCD product B C, as depicted in Fig. 5 . Although the general binary-to-BCD conversion is extensively addressed in the literature (e.g. [12 -14] ), we have managed to design a special, simpler and faster, binary-to-BCD converter as depicted in Fig. 6 . The first row in Fig. 6 depicts the required circuitry, where its correctness has been checked through VHDL (Very high speed integrated circuit Hardware Description Language) simulation, and black-filled boxes show the critical delay path. Here, we only note that because 0 p 6 p 5 p 4 p 3 p 2 p 1 p 0 81, the following logical hold
BCD partial product reduction
In BCD encoding of decimal digits, bit strings 1010 to 1111 are not used. This leads to some bit interdependencies, which may be beneficiary in designing a simpler and faster partial product tree for BCD digit multiplication.
Definition 1: (BCD constraint): Given that a BCD digit X ¼ x 3 x 2 x 1 x 0 , because 0 X 9, does not assume all the 16 possible bit strings, the constraints x 3 x 2 ¼ 0 and
Using the latter constraint on the bits of both X and Y, the partial product tree of Fig. 4 may be redrawn as the one in Fig. 7 , where þ is used to indicate a logical OR operation. Note that the items in the tree of Fig. 7 have been produced by adding the items in the relevant columns of Fig. 4 , using the BCD constraint for simplifications.
Summation of the four operands in the third column from right (i.e. position of p 2 ) may produce a carry for position of p 4 or a carry to position of p 3 , respectively, represented as x 2 y 2 x 1 y 1 x 0 y 0 and c in Fig. 8 , where c is easily derived as
To compute the binary product p 6 p 5 p 4 p 3 p 2 p 1 p 0 , we use a carry look-ahead logic to add the items in positions p 3 to p 5 , as depicted in Fig. 9 . It turns out that because of BCD constraint, defined above, no carry passes through position of p 5 . The overall delay of the circuits of Figs. 6 and 9, when cascaded, amounts to ten logic levels, where the black-filled gates show the critical path. In iterative multiplication, often the system-determined iteration cycle-time allows for more latent BCD digit multipliers. Therefore one may focus on area optimisation. The logic of Fig. 10 also depicts a binary product BCD digit multiplier, but with more delay and less area compared to that of Fig. 9 . Note that although p 6 is the most latent output of the circuit in Fig. 10 , the critical delay path of the whole multiplier, realised by cascading the circuit of Fig. 6 at the output of the circuit of Fig. 10 , goes through p 5 and the overall delay amounts to that of 13 logic levels. We will show, in the next section, that iterative BCD multipliers of the previous works can easily accommodate the latency of our area-optimised BCD digit multiplier. We have not encountered any direct implementation for BCD digit multipliers in the literature, except for look-up table implementations (e.g. [6, 7] ). The latest work based on decimal digit-by-digit multiplier converts the BCD operands to signed digits in [25, 5] and uses a signed-digit-by-signed-digit multiplier on a word-by-digit basis to generate the partial products, also represented by signed digits [4] . The latter work does not provide any area and time measures that can be used as a comparison basis.
To compare our results with other published works, we have designed iterative BCD multipliers based on the delay-optimised and area-optimised BCD-digit multipliers of the previous section. One hardware realisation of BCD multipliers [3] uses the iterative approach with precomputed easy multiples as explained in Section 2. Our approach for partial product generation is different from that of [3] , but both designs use the same method for partial product accumulation. Therefore for the sake of accurate comparison, we deemed it enough to run simulations only on the first part of multipliers, and measured areas of the partial product generation logic for the two approaches through simulation based on a 0.25 mm Complementary metal oxide semiconductor (CMOS) standard process. We had to use our own version of equations for the five multiples because of seemingly wrong equations in [3] . For further explanations on this claim see the appendix. It turns out that the area of our delay-optimised and area-optimised designs is 13% and 30% less than that of [3] , respectively.
The partial product generation based on the easy multiples is faster than our partial product generation scheme with a latency of 13 logic levels, as derived in Section 4. But in a pipeline design this is not a disadvantage, for the partial product generation takes up one stage of the pipeline whose cycle-time is determined by the latency of the most latent pipeline stage, which happens to be the partial product reduction stage with 13 logic levels as explained below. The iterative BCD multiplier of [3] uses a special (4:2) compressor for partial product reduction. The latter function is realised by a seven-logic-level BCD digit adder (implemented based on the design in [15] ) followed by a six-logic-level simplified one, where the second operand is a single bit.
More regularity in VLSI implementation may be considered as another advantage of our approach. The reason lies in using only one cell (i.e. BCD-digit multiplier) in the whole partial product generation logic. But in the easy multiples method, different cells for different multiples and 4n-bit four-way multipliers are used.
Another iterative multiplier [16] uses redundant decimal digits for representation of intermediate partial products. It operates in 14% higher clock frequency than that of [3] and ours, but requires 77% more area than that of [3] , and certainly much more than ours.
Conclusion
We have designed a novel BCD-digit multiplier cell that can be used in conventional iterative BCD multiplier circuits. We showed that this design alternative leads to 30% savings in the area of partial product generation logic. It does neither affect the rest of the multiplier circuitry, nor does it add to the overall delay of a pipelined implementation. Our design leads to more regular VLSI implementation, and does not require special registers for storing easy multiples. Further research is on going on efficient use of the designed BCD-digit multiplier in semi-and fully parallel BCD multipliers. 9 Appendix
The equations provided in [3] , for computing 5 Â X, seem to be faulty. For example, try 5 Â 1 ¼ (0101) Â (0001), which leads to 1 (0001), where the correct result is obviously 5 (0101).
The correct set of equations may be derived as follows Let X ¼ X n21 . . . X i X i21 . . . 
