A study of decimal left shifters for binary numbers  by Gonzalez-Navarro, Sonia et al.
Information and Computation 216 (2012) 47–56Contents lists available at SciVerse ScienceDirect
Information and Computation
www.elsevier.com/locate/yinco
A study of decimal left shifters for binary numbers✩
Sonia Gonzalez-Navarro a,∗, Javier Hormigo a, Michael J. Schulte b
a Department of Computer Architecture, E.T.S.I. Informatica, University of Malaga, 29071, Malaga, Spain
b Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, WI 53706, USA
a r t i c l e i n f o a b s t r a c t
Article history:
Available online 30 March 2012
Keywords:
Decimal arithmetic
Computer arithmetic
Decimal shifter
Barrel shifter
Hardware designs
Binary integer decimal
The importance of decimal ﬂoating-point (DFP) arithmetic has been growing in the last
years, and speciﬁcations for it are included in the revised IEEE 754 Standard for Floating-
Point Arithmetic (IEEE 754-2008). IEEE 754-2008 speciﬁes a binary encoding for decimal
signiﬁcands, in which the signiﬁcands of DFP numbers are represented as unsigned binary
integers. For this representation, which is commonly referred to as the binary integer
decimal (BID) encoding, fast decimal left shifting of a binary number is useful for operand
alignment, normalization, overﬂow avoidance, and quantize operations. A decimal left shift
of an unsigned binary integer, I , by S digit positions corresponds to multiplying I by 10S .
This paper presents the theory and design of decimal left shifters for binary numbers. The
designs perform decimal left shifting using optimized constant multiplications by selected
powers of ten. We propose and analyze different combinational and sequential decimal
left shifter architectures for binary numbers. The designs are compared to one another in
terms of area and delay using both theoretical estimates and synthesis results for 16-digit
(54-bit) binary inputs and 4-bit shift amounts. The results indicate that an optimized radix-
2 decimal shifter implementing partial shifters in carry-save format has 80% less area and
68% less delay than a lookup table followed by a binary multiplier to compute I × 10S ,
when both designs are optimized for delay. When both designs are optimized for area, the
radix-2 decimal shifter has 89% less area and 31% less delay.
© 2012 Elsevier Inc. All rights reserved.
1. Introduction
Decimal ﬂoating-point (DFP) arithmetic is important in many applications due to its ability to represent decimal fractions
exactly and mimic manual calculations that perform decimal rounding. One disadvantage of binary ﬂoating-point arithmetic
is that it cannot exactly represent many decimal fractions, such as 0.01 and 10−35 [1]. It also does not provide correct
decimal rounding. For many applications, the representation error and lack of decimal rounding are not a problem. However,
several applications require decimal arithmetic operations and rounding across a wide range of exponents. Such applications
include currency conversion, billing, insurance, tax calculations, and banking. One study estimates that errors from using
binary ﬂoating-point arithmetic can result in an accumulated error of over $5 million per year for a large billing system [2].
Due to the importance of DFP arithmetic, IEEE developed its standard for ﬂoating-point arithmetic (IEEE 754-2008 [3])
by including speciﬁcations for DFP formats and operations. With IEEE 754-2008, the value of a ﬁnite DFP number, x, is:
(−1)Sx × Cx × 10Ex-bias
✩ This research was supported by the Spanish grant CICYT TIN 2006-01078.
* Corresponding author.
E-mail addresses: sonia@ac.uma.es (S. Gonzalez-Navarro), fjhormigo@uma.es (J. Hormigo), schulte@engr.wisc.edu (M.J. Schulte).0890-5401/$ – see front matter © 2012 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.ic.2011.09.002
48 S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56Fig. 1. Decimal shifting through a table lookup and a multiplier.
where Sx is the sign bit, Ex is a biased exponent, bias is a constant value that makes Ex non-negative, and Cx is the signif-
icand, which is also referred to as the coeﬃcient. IEEE 754-2008 speciﬁes two encodings for DFP signiﬁcands; (1) a binary
encoding, known as Binary Integer Decimal (BID), and (2) a decimal encoding, known as Densely Packed Decimal (DPD).
With the BID encoding, the signiﬁcand is represented using an unsigned binary integer. With the DPD encoding the sig-
niﬁcand is represented using an unsigned decimal integer, in which three decimal digits are encoding using ten bits [4].
With either encoding, the signiﬁcand of a DFP number is not normalized, which means that a single DFP number may have
multiple representations. More details on the DFP formats and operations are provided in [3].
Several hardware designs for decimal ﬂoating-point arithmetic have been developed [1,5–13]. Furthermore, IBM recently
released decimal ﬂoating-point hardware on their POWER6, z9, z10 and z196 processors [14–17]. Most previous designs for
DFP arithmetic have focused on signiﬁcands with a radix of ten, such as those that use Binary Coded Decimal (BCD) or DPD
encodings. Recently, however, hardware designs for DFP numbers that use the BID encoding have been presented [18–20].
Decimal left shifting of a binary number is important in several BID-based DFP units. For example, in DFP addition and
subtraction, one of the input operands may need to be left shifted so that both operands have the same exponent [9,19]. In
DFP division and square root, it is useful to normalize the input operands through a left shift operation that removes leading
zeros [7,13,10,21]. When DFP multiplication and division produce a result with an exponent that is too large to represent in
the destination format, an attempt is made to avoid overﬂow by left shifting the signiﬁcand and decrementing the exponent
until either the exponent comes within range or the signiﬁcand can no longer be left shifted without exceeding the result
precision [11,12]. When performing the DFP-speciﬁc operation of quantize (x, y), if the exponent of x is greater than the
exponent of y, the signiﬁcand of x is left shifted and its exponent is decremented until it equals the exponent of y or the
maximum result precision is achieved [3,9].
Decimal left shifting of an unsigned decimal integer, I , by S digit positions corresponds to multiplying I by 10S . In this
paper we use the notation DLS(I, S) to denoted that I is left shifted by S decimal digit positions, such that DLS(I, S) =
I × 10S . Previous designs for BID arithmetic units have performed decimal left shifting by S digit positions through a table
lookup to obtain 10S , followed by a binary multiplication to obtain I × 10S [19], as shown Fig. 1.
Although this approach has the advantage that the binary multiplier used for decimal shifting can be used for other
operations, it has the disadvantage that the binary multiplier may become a bottleneck in the design if too many BID-based
DFP operations try to access it simultaneously.
This paper presents the theory and design of decimal left shifters for binary numbers. These designs perform decimal left
shifting using optimized constant multiplications by selected powers of ten. A wide variety of combinational shifters, which
differ in terms of the number of stages they require and the number and sizes of shifts in each stage, are studied. Besides
a pipelined and iterative (word-serial) implementation are also considered. The designs are compared to one another in
terms of area and delay using both theoretical estimates and synthesis results for 16-digit (54-bit) binary inputs and 4-bit
shift amounts. The remainder of this paper is organized as follows. Section 2 discusses the general technique of decimal
left shifting and presents the design of a radix-2 shifter with four stages that can perform up to 15 digit shifts. Section 3
presents higher radix shifters that require fewer, but more complex, stages than the radix-2 shifter. Section 4 presents
pipelined and word-serial decimal left shifters. Section 5 presents theoretical estimates and synthesis results for a wide
variety of decimal left shifters, followed by our conclusions in Section 6.
2. Decimal barrel left shifter
A decimal barrel left shifter is a combinational unit with two inputs; an m-bit unsigned binary integer, I , and an n-bit
unsigned binary shift amount, S = sn−1sn−2 . . . s1s0. The output, O , is an m-bit unsigned binary integer that represents the
binary input I shifted S digit positions to the left, such that O = DLS(I, S) = I × 10S , where 0 S  2n − 1.
S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56 49Fig. 2. Radix-2 barrel shifter for up to 15-digit shifts.
The designs of our decimal barrel shifters are similar to designs for binary barrel shifters. In both binary and decimal
shifters, a ﬁxed set of partial shifts are implemented, such that the complete shift is performed by a series of partial shifts.
For example, a 14-bit left binary shift might be implemented using partial left shifts of 8, 4, and 2 bits. Generally, the partial
shift amounts are 2i for i = 0 to n − 1, and bit si indicates if a partial left shift by 2i should be performed. This can be
expressed mathematically as:
DLS(A, S) = DLS(DLS(· · ·DLS(DLS(DLS(A, s0 · 20
)
, s1 · 21
)
, s2 · 22
)
, . . .
)
, sn−1 · 2n−1
)
(1)
Eq. (1) is implemented by a series of cascaded 2-to-1 multiplexers and partial left shifters, as shown in Fig. 2. The inputs
of each multiplexer are the output of the previous multiplexer and this output is shifted by a ﬁxed amount equal to 2i . Bit
si controls the multiplexer that takes as one of its inputs the output of the previous multiplexer left shifted by 2i . With
this radix-2 left shifter architecture, the number of m-bit 2-to-1 multiplexers required for any shift amount up to Smax is
log2(Smax).
The description above is valid for binary and decimal barrel shifters. The main difference is how the partial shifts are
implemented. For a binary barrel shifter, the partial shifts are performed by hard-wiring the multiplexer inputs to the ap-
propriate bits from the previous multiplexer’s output. In the case of decimal shifting, however, the partial shifts cannot just
be hardwired, since decimal left shifting corresponds to multiplication by powers of ten. Nevertheless, for each multiplexer
the power of ten to multiply by is a constant. Thus an optimized constant multiplier circuit can be designed for each partial
shift.
Each constant multiplication by a power of ten may be implemented by binary shifts and additions. If the constant by
which to multiply is denoted as C , then its bit c j being equal to one implies the addition of the input shifted j bits to
the left. Therefore, the number of additions required to perform multiplication by a constant C depends on the number of
ones in the binary representation of C [22]. For example, to perform a decimal left shift of 3 decimal digits, the constant
by which to multiply is C = 1000(10) = 1111101000(2) , so the number of additions required to multiply is 6. To reduce
the number of additions, a canonical radix-2 signed-digit representation of C is used. With this approach, C is recoded to
the digit set {−1,0,1} using the minimum number of non-zero digits. This reduces the number of operations, but may
require both additions and subtractions to perform constant multiplication. In Table 1, the binary representation of C and
its canonic signed digit representation are shown for the ﬁrst ﬁfteen powers of ten. The table also shows the number of
operands that need to be added or subtracted to perform constant multiplication by C .
To implement a decimal barrel shifter capable of performing left shifts up to ﬁfteen decimal digit positions, four constant
multipliers are required for multiplication by 101,102,104, and 108 (see Fig. 2). The last column of Table 1 shows the
number of values to be added/subtracted to implement these multipliers. To reduce the critical path delay of the entire
decimal shifter, an adder tree with outputs in carry-save format can be used to implement each constant multiplier. It is
important to note that this doubles the required number of multiplexers, since sum and carry words have to be transmitted
from one partial shift stage to the next. Furthermore, the number of values to add is doubled in each stage, except for the
ﬁrst stage. A carry propagate adder (CPA) is needed at the bottom of the decimal shifter, if a binary representation of the
output is desired. Since this CPA is the same for any decimal shifter architecture (including the solution based on a table
lookup and a binary multiplier), we assume a carry-save output for the theoretical study of the shifters presented in this
paper.
The architecture for a radix-2 decimal left shifter for up to 15 digit positions is shown in Fig. 2, in which the ﬁrst stage
implements an 8-digit partial shift. The architectures of the partial decimal shifters for the radix-2 decimal left shifter are
shown in Fig. 3. In these architectures are used carry-save 3 : 2 counters and 4 : 2 compressors. Each partial shifter performs
multiplication by the corresponding power of ten using hardwired binary shifts and additions/subtractions. For example, the
8-digit partial shifter, labeled DLS(I,8), performs:
in_sh8 · 108 = in_sh8 · 227 − in_sh8 · 225 − in_sh8 · 219 − in_sh8 · 217 − in_sh8 · 213 + in_sh8 · 28 (2)
50 S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56Table 1
Canonical signed-digit representation for the ﬁrst 15 powers of ten.
C Binary representation of C Canonical signed-digit representation of C #Operands
101 1010 1010 2
102 1100100 1100100 3
103 1111101000 10000101000 3
104 10011100010000 10100100010000 4
105 11000011010100000 101000101010100000 6
106 11110100001001000000 100010100001001000000 5
107 100110001001011010000000 100110001001011010000000 8
108 101111101011110000100000000 1010000010100010000100000000 6
109 111011100110101100101000000000 1000100101001010100101000000000 9
1010 1001010100000010111110010000000000 1001010100000101000010010000000000 8
1011 1011101001000011101101110100000000000 10100101001000100010010010100000000000 10
1012 110100011010100101001010001000000000000 10001100100101100101001010001000000000000 12
1013 10010001100001001110011100101010000000000000 10010001100001010010100100101010000000000000 12
1014 10110101111001100010000011110100100000000000000 101001010000110100010000100001100100000000000000 12
1015 11100011010111111010100100110001101000000000000000 100100100101000001010100100110001101000000000000000 14
Fig. 3. Partial left shifters for the decimal radix-2 barrel left shifter with up to 15-digit shifts.
The other partial shifters are similar, but take their inputs in carry-save format, which doubles the number of hardwired
binary shifts and additions/subtractions, but reduces the overall delay of the shifter.
Figs. 2 and 3 indicate that a radix-2 decimal barrel left shifter for up to 15 decimal digits has four 3 : 2 counters, six
4 : 2 compressors, and eight 2-to-1 multiplexers. Since the outputs from the partial shifters are in carry-save format, each
multiplexer shown in Fig. 2 requires two multiplexers. The maximum delay of the critical path is 4tmux +2t3:2 +5t4:2, where
tmux , t3:2, and t4:2 are the time delay for a multiplexer, 3 : 2 counter, and 4 : 2 compressor, respectively. It is important to
note that the placement of the 8-digit partial shifter in the ﬁrst stage is intentional, since it reduces the total number of
adders/subtractors. Implementing a 1-digit partial shift in the ﬁrst stage, would have added two 3 : 2 counters and one
4 : 2 compressor to the design, without changing the critical path delay. The general rule is to perform the partial shift that
requires the most additions/subtractions in the ﬁrst stage. Since the input to the ﬁrst stage is in non-redundant format, this
technique reduces the overall number of additions/subtractions.
3. High-radix decimal shifter implementations
In the previous section, the decomposition of the shift amount is performed in a bit-by-bit fashion, such that each bit
of S controls one multiplexer and n levels of partial shifts are needed, where n = log2(Smax). However, a higher-radix
decomposition of S is also possible. With this approach, multiple bits of S are used to select the multiplexer inputs in each
stage. This reduces the number of stages at the expense of increasing the number of partial shifts implemented in each
stage and the number of inputs to each multiplexer. Thus, using a higher radix may improve the delay of the left shifter,
but with a signiﬁcant increase in area.
S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56 51Fig. 4. Radix-4 barrel shifter for up to 15-digit shifts.
3.1. Radix-4 implementation
In a radix-4 implementation, two bits of S are used to select the partial shift amount in each stage, as illustrated in the
following equation:
DLS(A, S) = DLS(· · ·DLS(DLS(A, s0 · 20 + s1 · 21
)
, s2 · 22 + s3 · 23
)
, . . . , sn−2 · 2n−2 + sn−1 · 2n−1
)
(3)
where n is assumed to be even. This equation can be rewritten as
DLS(A, S) = DLS(· · ·DLS(DLS(A, s′0 · 20
)
, s′1 · 22
)
, . . . , s′(n/2)−1 · 2n−2
)
(4)
where s′i = s2i + 2 · s2i+1 and thus s′i ∈ {0,1,2,3}. If a radix-4 design is used, the number of stages is reduced from n to n/2,
but the number of non-zero partial shifts implemented in each stage is increased from 1 to 3. Thus, the total number of
non-zero partial shifts is n for radix-2 and 3n/2 for radix-4. As a result, the area may increase signiﬁcantly when a radix-4
decimal shifter is used. For example, a radix-4 decimal shifter for left shifts of up to 15 decimal digit positions may be
implemented using decimal shifts of 0, 4, 8, and 12 in the ﬁrst stage and 0, 1, 2, and 3 in the second stage.
As illustrated in Table 1, the required number of additions/subtractions usually increases for larger partial shift amounts.
The partial shifts required for the radix-2 left shifter (e.g., 1,2,4,8, . . .) are also necessary in the radix-4 left shifter. Some
of the additional partial shifts required in the radix-4 design are also larger, and in general more complex, since they are
compositions of the radix-2 shift amounts. This has an important effect not only on area, but also on the delay of the each
stage, and consequently on the critical path delay. This effect reduces the advantage of halving the number of stages, as
shown in Section 5.
An initial radix-4 architecture to implement left shifts by up to 15 decimal digit positions (n = 4) has constant multipliers
by 104, 108, 1012 in the ﬁrst stage and 101, 102, 103 in the second stage. To minimize the number of adders/subtractors,
the stage with the most additions/subtractions is implemented ﬁrst. However, this is not the only option for implementing a
radix-4 decimal barrel shifter. Since decimal shift operations (multiplications by powers of ten) fulﬁll the commutative and
associative properties, any combination of associations of two bits of S can be used to deﬁne a radix-4 shifter architecture.
Fig. 4 shows an example of an alternative radix-4 shifter with different partial shift amounts. Taking into account the
complexity of each constant multiplier, the best combination of partial shifts (in terms of area or delay) can be determined.
In Section 5, all possible shift combinations for radix-4 shifters are studied.
3.2. Other radix implementations
The use of radices higher than 4 can be also considered, but it requires a large increase in area. If p bits are used in
each stage, the number of stages is n/p and the number of non-zero partial shifts implemented in each stage is 2p − 1.
Therefore, the number of stages decreases linearly with p, but the number of constant multipliers per stage increases ex-
ponentially. Hence, the total number of constant multipliers required for high radix implementations increases very quickly.
Furthermore, as discussed above, the additional constant multipliers will be generally more complex for higher radices,
which affects both area and delay. For the case of n = 4 in radix-16, only one stage is required, but all possible decimal
shifts from 0 to 15 digits have to be implemented. With a radix-16 decimal shifter that can shift up to 15 digits, the area
required is 22 3 : 2 counters and 31 4 : 2 compressors, which is about ﬁve times more area than the radix-2 architecture.
The delay of the critical path is 1tmux16 + 3t4:2, which is about half the delay of the radix-2 architecture. Thus, a large
reduction in delay can be achieved, at the cost of an enormous area increase.
Another approach is to use heterogeneous-radices, in which different stages may use a different number of bits from S
to select the multiplexer inputs in each stage. To guarantee that any number of digits can be shifted, each bit of S should
be used exactly one time. For example, in a radix-4-2-2 architecture with up to 15-digit shifts (n = 4), two bits of S are
52 S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56used in the ﬁrst stage and the remaining two stages use one bit each. Similarly, in a radix-8-2 architecture, three bits are
used in one stage and the remaining bit is used in the other. When designing these shifters, it is useful to have the higher
radix stages with the simplest partial shifts and leave the larger, more complex shifts for low radix stages.
4. Sequential shifters
As mentioned before, the previous decimal left shifter designs are combinational units which, in all probability, will be
present in a sequential system. But it could happen that the delays of previous designs do not meet the clock frequency
of the systems where they are implemented. In this case, a solution would be to perform left shifting in several cycles.
To do this, the previous decimal left shifter designs can be pipelined or a new word-serial (iterative) shifter circuit can be
introduced. In Section 4, the synthesis results assuming different number of pipeline stages for several barrel left shifter
designs will be shown and discussed in more detail. Next, the word-serial left shifter is presented.
A way of reducing area while the clock frequencies are reached, is to implement a word-serial left shifter. In a word-
serial implementation the hardware for one iteration is reused several times to obtain the wanted result. Speciﬁcally, in
each cycle it is performed a partial shift of up to S ′max digit positions being S ′max < Smax . The result obtained in one cycle is
stored in a register to be used as input in the next cycle. Thus, to perform a decimal left shifting by S digit positions when
S > S ′max , the number of clock cycles required is S/S ′max. During q = S/S ′max cycles, it is performed S ′max partial shifts. If
S is not multiple of S ′max , there is a last cycle where a R-digit partial shift is carried out being R = rem(S, S ′max). Therefore,
the word-serial shift can be expressed mathematically as:
DLS(A, S) = DLS1
(
DLS2
(· · · (DLSq
(
A, S ′max
)
, S ′max
)
, · · ·), R) (5)
where the decimal left shiftings DLSi are carried out in the clock cycle i.
As an example of an iterative left shifting, let’s assume that A = 568, S = 7 and S ′max = 3. In this case, 3 (q = 2 and R = 1)
cycles are needed to get the ﬁnal result. After the ﬁrst two 3-digits partial shifts, a partial result of A′ = 568,000,000 is
obtained. In the last cycle is performed a partial shift of one position obtaining the ﬁnal result DLS(598,7) = 5,680,000,000.
Consequently, a word-serial implementation has variable latency depending on the shift amount S . The selection of S ′max
(maximum shift amount per cycle) has an important impact on the latency and hardware requirements of the circuits. The
larger is S ′max the smaller is the number of iterations to obtain the ﬁnal result but the greater is the clock cycle and area of
hardware needed.
In this paper we study two architectures of word-serial shifters performing up to 15 decimal digit positions left shift. In
both designs S ′max = 3 is selected for maximum partial shifts amount in one cycle. Thus, up to 5 cycles are needed in the
worst case (i.e. when S = 15). There are two reasons of choosing a hardware implementing up to S ′max = 3 shifts. The ﬁrst
reason is because a great percentage of decimal operations on decimal64 ﬂoating-point operands, needs a small amount of
shifting (for example, the addition or quantize operations). The second reason is because these partial shifts (0, 1, 2 and
3-digit shifts) have the fewer number of operands to be added (or subtracted) as it is shown in Table 1.
Fig. 5 and Fig. 6 show the radix-2 and radix-4 word-serial decimal left shifters studied in this paper, respectively. The
radix-2 word-serial left shifter implements in a ﬁrst stage an 1-digit partial shifter and in a second stage a 2-digit partial
shifter, whereas the radix-4 word-serial left shifter implements in only one stage an 1-digit, 2-digit and 3-digit partial
shifters. Both designs implement partial shifters operating with numbers in carry-save format. This doubles the number of
hardwired binary shifts and additions/subtractions, but reduces the overall delay of the shifters. In addition, each multiplexer
and register shown in Fig. 5 and Fig. 6, needs to be doubled. The hardware required for the radix-2 word-serial shifter in
carry-save format is 2 3 : 2 counters, 2 4 : 2 compressors, 6 multiplexers 2-to-1 and 2 registers to store the partial shifts.
Based on a technique described in [23], where a 3 : 2 counter is one unit, a 4 : 2 compressor is two units, and a k-input
multiplexer is 0.25 · k units, the total area is roughly estimated as 10 3 : 2 counters. Similarly, the total delay is estimated
such as t4:2 = 1.5t3:2, tmux2 = 0.25t3:2, treg = t3:2, where treg is the time delay for a register. Therefore, the delay of the
critical path of the radix-2 shifter is 3tmux2 + 2t4:2 + 1t3:2 + 1treg = 5.75t3:2.
The hardware required for the radix-4 word-serial shifter is 4 3 : 2 counters, 3 4 : 2 compressors, 2 multiplexers 2-to-1, 2
multiplexers 4-to-1 and 2 registers to store the partial shifts. Therefore, based on the previous technique, we have that the
total area of the radix-4 word-serial shifter in carry-save format is equivalent to 14 3 : 2 counters, and taking into account
that tmux4 = 0.4t3:2, the delay of the critical path of the radix-4 shifter is 1tmux4 + 1tmux2 + 1t4:2 + 1t3:2 + 1treg = 4.15t3:2.
We have to mention that additional hardware is required to implement the logic control of the shifters. Nevertheless,
the area of the control can be discarded in the previous approximation. As we will see below, only a few bits of S take
part in the control, as well as, a small subtractor and a register (of size 4 bits), so the area of these units are pretty small
compared to the area of the hardware needed to shift decimal64 operands. On the other hand, since the control logic unit
works in parallel with the partial shifts, this one does not add delay to the critical path.
The control for these word-serial designs are quite simple as can be seen in Figs. 5 and 6. The number of cycles needed
to perform an S-digit shift is calculated as follows: in the ﬁrst cycle S is checked to see if it is less than S ′max = 3. If it is
so, only one cycle is needed to perform decimal left shifting. Otherwise, a subtraction of S ′max from S is carried out, giving
a partial amount S ′ which is stored in a register. In the next cycles the same process is followed with the amount S ′ (the
remainder amount to shift) until S ′ < 4. The signals CTi controlling the multiplexers of the datapath are generated using
only some bits of S ′ .
S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56 53Fig. 5. Word-serial (Iterative) radix-2 left shifter with up to 15-digit shifts.
Fig. 6. Word-serial (Iterative) radix-4 left shifter with up to 15-digit shifts.
5. Implementation results and comparison
5.1. Theoretical comparison
In this section the different decimal shifter architectures presented previously are compared to one another in terms of
area and delay using theoretical estimates. This allows a comparison that is independent of implementation technologies,
but does not take into account wire delay and fan-out. Each of the shifters presented is able to left shift by up to 15 decimal
digit positions. Since a single carry propagate adder may be used at the end of each design to produce a non-redundant
binary result, these elements are not included in the theoretical results presented here.
In Table 2 a selected group of designs combining different radices and partial shifts are presented with theoretical area
and delay results. The ﬁrst column corresponds to the name given to each design, where {s3, s2, s1, s0} are the bits of S
used on each stage and ‘:’ is used to separate different stages. For example, V (s1s3 : s0s2) is a radix-4 design in which
s1, s3 control the ﬁrst stage and s0, s2 control the second stage. The third column shows inside parenthesis the partial
shifts involved in each stage. The total theoretical area and the area corresponding to the multiplexers and registers are
roughly estimated based on a technique described in [23], where a register is one unit, a 3 : 2 counters is one unit, a
4 : 2 compressors is two units, and a k-input multiplexer is 0.25 · k units. Similarly, the total delay is estimated such as
treg = 1t3:2, t4:2 = 1.5t3:2, tmux2 = 0.25t3:2, tmux4 = 0.4t3:2 and higher radix multiplexers are implemented using the previous
ones. The columns labeled “O(A3:2)” and “O(t3:2)” show the total area/delay of the multiplexers and the registers of the
different designs.
Based on Table 2, we can estimate that the radix-4 designs have between 25% and 30% less delay than the radix-2 design,
but have between 56% and 76% more area. The delay of the radix-16 design is about half the delay of the radix-2 design, but
it requires roughly ﬁve times as much area. Some of the radix-4-2-2 designs, such as V (s1s3 : s0 : s2), V (s0s1 : s2 : s3) and
V (s1s2 : s0 : s3), are promising, since they have less delay than the radix-2 design, with a relatively small increase in area.
54 S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56Table 2
Theoretical results of area and delay for different shifter architectures.
Design Radix Partial shift Area Delay
3 : 2 4 : 2 O(A3:2) Total Norm. 3 : 2 4 : 2 O(t3:2) Total Norm.
V (s3 : s0 : s1 : s2) 2 (0,8)-(0,1)-(0,2)-(0,4) 4 6 2 18 1.00 2 5 1 10.5 1.00
V (s1s3 : s0s2) 4 (0,2,8,10)-(0,1,4,5) 5 12 2 31 1.72 1 4 0.8 7.8 0.74
V (s0s3 : s1s2) 4 (0,1,8,9)-(0,2,4,6) 11 8 2 29 1.61 3 3 0.8 8.3 0.79
V (s2s3 : s0s1) 4 (0,4,8,12)-(0,1,2,3) 8 9 2 28 1.56 2 3 0.8 7.3 0.70
V (s1s3 : s0 : s2) 4-2-2 (0,2,8,10)-(0,1)-(0,4) 3 8 2 21 1.17 0 5 0.9 8.4 0.80
V (s0s1 : s2 : s3) 4-2-2 (0,1,2,3)-(0,4)-(0,8) 4 7 2 20 1.11 2 4 0.9 8.9 0.85
V (s0s3 : s1 : s2) 4-2-2 (0,1,8,9)-(0,2)-(0,4) 9 6 2 23 1.28 3 4 0.9 9.9 0.94
V (s0s2 : s1 : s3) 4-2-2 (0,1,4,5)-(0,2)-(0,8) 6 7 2 22 1.22 3 4 0.9 9.9 0.94
V (s1s2 : s0 : s3) 4-2-2 (0,2,4,6)-(0,1)-(0,8) 4 7 2 20 1.11 2 4 0.9 8.9 0.85
V (s2s3 : s0 : s1) 4-2-2 (0,4,8,12)-(0,1)-(0,2) 6 8 2 24 1.33 2 4 0.9 8.9 0.85
V (s0s1s2 : s3) 8-2 (0,1, . . . ,7)-(0,8) 7 10 2.5 29.5 1.64 1 4 0.9 7.9 0.75
V (s0s1s2s3) 16 (0,1, . . . ,15) 22 31 4 88 4.89 0 3 0.8 5.3 0.50
WS-r2 2 (0,1)-(0,2) 2 2 2.5 10 0.55 1 2 1.75 5.75 0.54
WS-r4 4 (0,1,2,3) 4 3 5 14 0.77 1 1 1.65 4.15 0.39
Table 3
Synthesis results for different combinational architectures.
Design Delay optimized Area optimized
Area (μm2) Delay (ns) Area (μm2) Delay (ns)
Table+Mult (TLM) 31960 1.24 17024 3.61
V (s3 : s0 : s1 : s2) 10281 1.18 4569 2.44
V (s1s3 : s0s2) 14395 1.00 8059 2.71
V (s0s3 : s1s2) 14658 0.95 8219 2.66
V (s2s3 : s0s1) 15261 0.93 7602 3.12
V (s1s3 : s0 : s2) 12411 1.08 6300 2.61
V (s0s1 : s2 : s3) 11926 1.03 5449 2.4
V (s0s1s2 : s3) 19065 0.91 10631 2.25
V (s0s1s2s3) 47335 0.69 26003 2.17
The word-serial designs reduce the area and decrease the clock cycle. In the case of the radix-2 word-serial implementation
both area and delay are almost halved compare to the combinational radix-2 design (V (s3 : s0 : s1 : s2)). Thus, it could be
interesting if in most decimal operations using left decimal shifting the amount to be shifted is less than 6 (since less than
two cycles would be needed and the delay would be similar to the combinational radix-2, but having half of the area).
5.2. Synthesis results
In this section, practical implementations of decimal barrel shifters for decimal64 operands with signiﬁcands encoded in
BID format are studied. For this case, the signiﬁcand has 54 bits (to represent up to 16-digit decimal numbers) and the shift
amount can vary from 0 to 15 decimal digits. A selected group of the decimal shifters presented in this paper have been
implemented in Verilog, simulated using ModelSim 6.0, and synthesized using Synopsys Design Compiler and the TSMC
65 nm library in which one cell unit has an area equal to 1 μm2. For comparison purposes, a decimal shifter that uses a
table lookup and binary multiplier tree has also been implemented. All designs produce their output in a non-redundant
binary format.
First, we present the implementation results of different combinational shifters modeled at very high level to allow an
easy optimization by the synthesis tool. In these designs, input/output signals of the partial shifters are in non-redundant
format although optimized redundant adder trees are used internally by the synthesis tool. Table 3 summarizes the area
and delay results for the selected combinational shifters optimized for two different goals: less area or less delay.
As shown in Table 3, the radix–2 design achieves a very large area reduction compared to the design based on a binary
tree multiplier and table lookup (TLM) (TLM is between 3 and 4 times larger) and faster than it (when optimized in area
TLM is almost a third slower). The higher radix V (s0s1s2 : s3) and V (s0s1s2s3) shifters have the lowest delay, when all
designs are optimized for delay. When optimizing for area, the radix–2 architecture has the lowest area.
It is important to note that the area results obtained through synthesis roughly coincides with the theoretical area
estimates. The delay results obtained from synthesis for higher radix implementations are much more than predicted by
the theoretical estimates. This may be due to increased wire delays and fan-out in the higher-radix architectures, which are
S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–56 55Table 4
Synthesis results for different combinational architectures in carry-save format.
Design Delay optimized Area optimized
Area (μm2) Delay (ns) Area (μm2) Delay (ns)
Table+Mult (TLM) 31960 1.24 17024 3.61
V (s3 : s0 : s1 : s2) 6430 0.39 1910 2.5
V (s1s3 : s0s2) 6739 0.43 1885 3.11
V (s0s3 : s1s2) 7090 0.43 1885 3.2
V (s2s3 : s0s1) 6729 0.42 1892 3.28
V (s1s3 : s0 : s2) 8076 0.39 1862 2.97
V (s0s1 : s2 : s3) 7622 0.41 1900 3.01
V (s0s1s2 : s3) 8318 0.45 2281 3.06
V (s0s1s2s3) 7319 0.49 3378 2.92
Table 5
Synthesis results for different pipelined shifters when optimizing for area.
Design 2 stages 3 stages 4 stages
Area (μm2) Delay (ns) Area (μm2) Delay (ns) Area (μm2) Delay (ns)
Table+Mult (TLM) 22086 1.51 26032 1.03 29493 0.78
V (s3 : s0 : s1 : s2) 2574 0.89 2930 0.79 3527 0.73
V (s1s3 : s0s2) 2645 0.91 3054 0.88 3714 0.74
V (s1s3 : s0 : s2) 2584 0.89 3032 0.88 3701 0.75
V (s0s1s2 : s3) 2975 0.95 3422 0.92 4079 0.73
V (s0s1s2s3) 4482 0.89 4580 0.87 5234 0.74
Table 6
Synthesis results for word-serial shifters.
Design Delay optimized Area optimized
Area (μm2) Delay (ns) Area (μm2) Delay (ns)
WS-r2 5647 0.38 2938 2.22
WS-r4 5929 0.38 3009 2.22
not taken into account by the theoretical delay estimates. Also, the tool may have a more diﬃcult time synthesizing the
higher-radix designs because they are more complex.
In a second implementation of the architectures, the partial shifters use carry-save input/output signals and they were
modeled speciﬁcally using Wallace tree (as mentioned earlier, the ﬁnal output is produced in a non-redundant binary
format). Table 4 summarizes the area and delay results for the different architectures.
As shown in Table 4, in this case barrel shifter designs achieve larger area reduction compared to the design based
on a table lookup and a binary tree multiplier (TLM) (TLM is between 5 and 9 times larger than radix-2 design) and
also achieve an important decrement in delay (TLM is between 0.5 and 3 times slower than radix-2 design). The radix-2
V (s3 : s0 : s1 : s2) and the higher radix V (s1s3 : s0 : s2) shifters have the lowest delay, when all designs are optimized for
delay. When optimizing for area, the V (s1s3 : s0 : s2) architecture has the lowest area.
It is important to note that the area results obtained through synthesis are a little better than the theoretical estimates.
This may be due to the tool has reused some components when implementing a shifter (some Wallace tree modules have
been reused to implement different partial shifts in the radix-16 shifter). Again the delay results obtained from synthesis
for higher radix implementations are a little more than predicted by the theoretical estimates, which as it was stated above,
it is due to the tool may have a more diﬃcult time synthesizing the higher-radix designs because they are more complex.
As mentioned in Section 4, we evaluate several of the shifters architectures presented previously when they are pipelined
for high clock frequencies. In Table 5 are shown the results obtained for different stages and when they are optimized for
area. As we can see, the different barrel shifter architectures have less area and delay than the multiplier. We have to note
that still radix-2 is a good candidate in terms of area and delay to be implemented in the design of BID-based decimal
ﬂoating-point arithmetic units.
Table 6 shows the results for the two word-serial shifters architectures studied in the paper. Both designs have the same
delay when they are optimized for area and delay. However, the radix-2 word-serial architecture has less area in both cases.
This is due to radix-4 implements one more partial shifter. These designs when are optimized for delay have less area and
delay than the combinational radix-2 shifter. In contrast, the latency when a shift greater than S ′max = 3 is issued can be up
to 5 times more than the latency (i.e. the delay) of a radix-2 combinational circuit. Therefore, a word-serial design could be
implemented in a system where it is known that most of the shifts are smaller than S ′max .
56 S. Gonzalez-Navarro et al. / Information and Computation 216 (2012) 47–566. Conclusions
This paper has presented the theory and design of several decimal left shifters for binary numbers, in both combina-
tional and sequential version. The designs perform decimal left shifting using optimized constant multiplications by selected
powers of ten. Combinational designs differ from one another in terms of the number of shifting stages and the number
and size of partial decimal shifts in each stage. In addition, pipelined and word-serial shifters have been introduced to allow
clock frequency increase. When the designs are optimized for area, a radix-2 combinational decimal shifter has the lowest
area and delay and requires roughly 9 times less area than a lookup table followed by a binary multiplier. When the designs
are optimized for delay, a radix-2 word-serial decimal shifter has the lowest delay and area but with the drawback that the
latency can be up to ﬁve times greater than a radix-2 combinational shifter. We have seen that in general all the combina-
tional and pipelined barrel shifters architectures presented here have less area and delay than a table lookup followed by
a binary multiplier. Thus, it may be beneﬁcial to implement these shifters in BID-based decimal ﬂoating-point arithmetic
units having a binary multiplier. This can help to avoid contention in the binary multiplier, as it is also used for rounding,
signiﬁcand multiplication, and other functions.
References
[1] M.F. Cowlishaw, Decimal ﬂoating-point: Algorism for computers, in: Proceedings of the 16th IEEE Symposium on Computer Arithmetic, pp. 104–111.
[2] IBM Corporation, The ‘telco’ benchmark, http://speleotrove.com/decimal/telco.html, 2005.
[3] IEEE 754-2008, IEEE standard for ﬂoating-point arithmetic, 2008.
[4] M.F. Cowlishaw, Densely packed decimal encoding, in: IEE Proceedings – Computers and Digital Techniques, vol. 149, pp. 102–104.
[5] L.-K. Wang, M.J. Schulte, Decimal ﬂoating-point square root using Newton–Raphson iteration, in: Proceedings of IEEE International Conference on
Application-Speciﬁc Systems, Architectures and Processors, pp. 309–315.
[6] J. Thompson, M.J. Schulte, N. Karra, A 64-bit decimal ﬂoating-point adder, in: Proceedings of the IEEE Computer Society Annual Symposium on VLSI,
Lafayette, LA, pp. 297–298.
[7] H. Nikmehr, B. Phillips, C.-C. Lim, Fast decimal ﬂoating-point division, IEEE Trans. VLSI Systems 14 (2006) 951–961.
[8] L.-K. Wang, Processor support for decimal ﬂoating-point arithmetic, PhD thesis, Department of Electrical and Computer Engineering, University of
Wisconsin–Madison, 2007.
[9] L.-K. Wang, M.J. Schulte, Decimal ﬂoating-point adder and multifunction unit with injection-based rounding, in: Proceedings of the 18th IEEE Sympo-
sium on Computer Arithmetic, Montpellier, France, pp. 56–68.
[10] L.-K. Wang, M.J. Schulte, A decimal ﬂoating-point divider using Newton–Raphson iteration, J. VLSI Signal Process. (2007) 727–739.
[11] M.A. Erle, M.J. Schulte, B.J. Hickmann, Decimal ﬂoating-point multiplication via carry-save addition, in: Proceedings of the 18th IEEE Symposium on
Computer Arithmetic, pp. 348–358.
[12] B. Hickmann, A. Krioukov, M.A. Erle, M. Schulte, A parallel IEEE P754 decimal ﬂoating-point multiplier, in: International Conference on Computer
Designs, pp. 296–303.
[13] S.R.C. Eric M. Schwarz, Power6 decimal divide, in: Proceedings of the 18th IEEE Symposium on Application-Speciﬁc Systems, Architectures and Proces-
sors, pp. 128–133.
[14] J. Leenstra, S.M. Mueller, C. Jacobi, J. Preiss, E.M. Schwarz, S.R. Carlough, IBM POWER6 accelerators: Vmx and dfu, IBM J. Res. Development 51 (2007)
1–21.
[15] A.Y. Duale, M.H. Decker, H.-G. Zipperer, M. Aharoni, T.J. Bohizic, Decimal ﬂoating-point in z9: An implementation and testing perspective, IBM J. Res.
Development 51 (2007).
[16] C.H. Webb, IBM z10: The next-generation mainframe microprocessor, IEEE Micro 28 (2008) 19–29.
[17] S. Carlough, S. Mueller, A. Collura, M. Kroenerand, The IBM zenterprise-196 decimal ﬂoating point accelerator, in: Proceedings of the 20th IEEE Sym-
posium on Computer Arithmetic, pp. 139–146.
[18] C. Tsen, M.J. Schulte, S.G. Navarro, Hardware design of a binary integer decimal-based IEEE P754 rounding unit, in: Proceedings of the IEEE International
Conference on Application-Speciﬁc Systems, Architectures, and Processors, pp. 115–121.
[19] C. Tsen, S.G. Navarro, M.J. Schulte, Hardware design of a binary integer decimal-based ﬂoating-point adder, in: Proceedings of the 25th IEEE Interna-
tional Conference on Computer Design, pp. 288–295.
[20] S.G. Navarro, C. Tsen, M.J. Schulte, A binary integer decimal-based multiplier for decimal ﬂoating-point arithmetic, in: Proceedings of the 41st Asilomar
Conference on Signals, Systems, and Computers, pp. 353–357.
[21] T. Lang, A. Nannarelli, A radix-10 digit-recurrence division unit: Algorithm and architecture, IEEE Trans. Comput. 56 (2007) 727–739.
[22] M. Ercegovac, T. Lang, Digital Arithmetic, Morgan Kaufmann, 2004.
[23] J.-A. Pineiro, M. Ercegovac, J. Bruguera, Algorithm and architecture for logarithm, exponential, and powering computation, IEEE Trans. Comput. 53
(2004) 1085–1096.
