In arithmetic circuits for digital signal processing, radixes other than two are often used to make circuits faster. In such cases, radix converters are necessary. However, in general, radix converters tend to be complex. This paper considers design methods for p-nary to binary converters. First, it considers Look-Up Table ( LUT) cascade realizations. Then, it introduces a new design technique called arithmetic decomposition by using LUTs and adders. Finally, it compares the amount of hardware and performance of radix converters implemented by FPGAs. 12-digit ternary to binary converters on Cyclone II FPGAs designed by the proposed method are faster than ones by conventional methods.
Introduction
Arithmetic operations of digital systems usually use radix two [9] . However, in digital signal processing, for high speed operations, p-nary (p > 2) numbers are often used [2] , [6] . In such cases, conversions between binary numbers and p-nary numbers are necessary. Such an operation is called radix conversion [3] , [8] . Various methods exist to convert p-nary numbers into binary numbers. Many of them require large amount of computation. Especially when the radix conversion is implemented by a random logic circuit, the network tends to be quite complex [7] . Radix converters can be implemented by table lookup. That is, to store the conversion table in the memory. This method is fast but requires a large memory.
In [11] , LUT cascade realizations [10] of binary to ternary converters, ternary to binary converters, binary to decimal converters, and decimal to binary converters are presented. In [13] , the concept of weighted-sum functions (WS functions) is used to design radix converters by using LUT cascades.
In this paper, we consider the design of circuits that convert p-nary numbers into binary numbers by using arithmetic decomposition [12] . We also consider the implementations on field programmable gate arrays (FPGAs). For readability, we use examples for p = 3, however the meth- ods can be easily extended to any prime number p. A preliminary version of this paper was presented at ISMVL-2006 [4] .
Radix Converter

Radix Conversion
Definition 1: Let a p-nary number of n-digit be x = (x n−1 , x n−2 , . . . , x 0 ) p , and let a q-nary number of m-digit be y = (y m−1 , y m−2 , . . . , y 0 ) q . Given a vector x, the radix conversion is the operation to obtain y that satisfies the relation:
where x i ∈ P, y j ∈ Q, P = {0, 1, . . . , p − 1}, and Q = {0, 1, . . . , q − 1}.
Let y = (y m−1 , y m−2 , . . . , y 0 ) 2 , y i ∈ {0, 1} be the output functions of p-nary to binary converter. Then, when p is a prime number, y i depends on all the inputs x i (i = 0, 1, . . . , n − 1). When p is not a power of two, we have an incompletely specified function. When we implement a p-nary to binary converter, unused combinations exist. Usually, we assign 0s to the undefined outputs.
Example 1:
In the case of a ternary to binary converter, we use the binary-coded-ternary code to represent a ternary number. That is, 0 is represented by (00), 1 is represented by (01), and 2 is represented by (10) . Note that (11) is the unused code. Table 1 is the truth table of a two-digit ternary to binary converter. In the binary-coded-ternary representation, (11) is an undefined input, and the corresponding output is don't care. In Table 1 , the inputs in the binary-codedternary representation are denoted by z = (z 3 , z 2 , z 1 , z 0 ). A straightforward way to implement a radix converter is to apply a logic synthesis tool directly to Eq. (1).
Example 2:
Consider the 8-digit ternary to binary converter (8ter2bin). In this case, we realize Figure 1 shows the circuit for 8ter2bin produced by a logic synthesis tool. Note that 3 2 x 2 is implemented as 8x 2 + x 2 and 3 1 x 1 is implemented as 2x 1 + x 1 , but 3 i x i (i = 3, 4, . . . , 7) are implemented by multipliers. Also, a cascade adder is used to obtain the result. Since the coefficients are constants, the multipliers can be replaced by adders. However, the wiring of resulting circuit is rather complex.
(End of Example)
Realization by a Single Memory
The simplest realization uses a single memory that stores the truth table of the radix converter.
A p-nary number of n-digit takes values from 0 to p n − 1. A binary number requires m = log 2 (p n −1) bits to represent the number. When the input is represented by a binarycoded p-nary number, let d be the number of bits for an input digit. Then, we have the relation: d = log 2 p . In this paper, we will consider networks that convert p-nary numbers into binary numbers, where each input is represented by a binary-coded p-nary number. Let SIZE(n, p) be the number of bits to realize the network by a single memory. We have the relation: SIZE(n, p) = 2 log 2 p ·n · log 2 (p n − 1) . Figure 2 shows the network that converts n-digit binary-coded p-nary numbers into binary numbers. When the number of digits for the radix converter is large, the memory will be huge.
LUT Cascade Realization and WS Function
In a direct realization of a p-nary to binary converter, the amount of hardware and the propagation delay increase with the number of input digits. To reduce the amount of hardware, LUT cascades realizations are used in [11] , where outputs are partitioned into groups, and each group is synthesized by using functional decomposition.
From here, we will consider LUT cascades realization without partitioning outputs.
LUT Cascade Realization of a Radix Converter
A single memory realization of a radix converter is simple, but requires a large memory. We can reduce the total amount of memory by decomposing the memory. First, we consider the method to decompose the memory into M 0 and M 1 , shown in Fig. 3 . In Fig. 3 , the lower k-digits of a binary-coded p-nary number are connected to the inputs of M 0 . Outputs from M 0 and the upper n − k digits of the input are connected to the inputs of M 1 . Outputs of M 1 represent the converted binary number.
By using the functional decomposition theory [9] , we can decide such realization is possible or not.
Definition 2:
Consider the function f ( X) : P n → Q, where
. . , X n−1 ). The decomposition chart for f is a two-dimensional matrix, where the column labels have all possible assignments of elements of X to X H , the row labels have all possible assignments of elements of X to X L , Fig. 3 m-digit p-nary to binary converter: LUT cascade method. and the corresponding matrix value is equal to f ( X H , X L ). The one whose column label values and row label values increase when the label moves from left to right, and from top to bottom, is the standard decomposition chart. The number of different column patterns in the decomposition chart is the column multiplicity.
Note that in an ordinary decomposition chart, the partitions of variables and the order of labels in the columns and rows are arbitrary. However, in the standard decomposition chart, the labels of the columns are in increasing order of X H = (X 0 , X 1 , . . . , X k−1 ), and the labels of the rows are in increasing order of
Lemma 1: Consider the function f ( X) that represents the conversion from p-nary numbers into binary numbers. The column multiplicity of the standard decomposition chart for Example 4: Consider the converter from 4-digit ternary numbers into 7-digit binary numbers. Table 2 shows a standard decomposition chart of it.
WS Functions
The weighted sum function (WS function) is a mathematical model of radix converters, bit-counting circuits, and convolution operations [12] , [13] .
Definition 3:
An n-input WS function [13] is defined as
where
is the weight vector, and each element is an integer. Table 2 Standard decomposition chart for a function of 4-digit ternary to binary conversion. In this paper, we represent a radix converter with a WS function, where inputs are represented by binary-coded pnary numbers. From here, unless otherwise noted, w i and x i denote non-negative integers. 
Next, we will consider the range of WS functions. Next, assume that the theorem holds for k-variable WS
From the hypothesis of induction, we have w k ≤ MAX k + 1. Hence, we have
Thus, the theorem holds.
Next, we will show the necessity. If w 0 ≥ 2, MAX 1 = (p−1)w 0 . On the other hand, since x 0 can take only p values,
Hence, we have the theorem.
(Q.E.D.)
From here, we will consider LUT cascade realizations for radix converter, where total amount of memory is the minimum. Figure 4 shows an LUT cascade realization of an n-digit p-nary to binary converter, where each LUT has only one external input x i . In this case, i-th The next lemma shows a method to detect mergible LUTs in an LUT cascade.
Lemma 3:
Consider the LUT cascade in Fig. 6 , where LUT H has k-input and k-output. In this case, without loss of minimality, LUTs H and G can be merged into G'. (Proof : Omitted) By using Lemma 3, we can find the mergible LUTs, and reduce the number of combinations to find the minimum LUT cascades. In the case of Example 6, we can see that the leftmost two LUTs in (0) can be merged, to obtain the LUT cascade in (1) . Next, by merging the leftmost two LUTs in (1), we have the LUT cascade in (3). In (3), the number of LUTs is two. So, we need only to consider the cascades where these two LUTs are merged and not.
The next algorithm finds the LUT cascade for the radix converters with the minimum amount of memory. By partitioning outputs into plural groups, we can realize radix converters with plural LUT cascades [11] . This method uses binary decision diagrams to find functional decompositions. Example 8 shows LUT cascades realizations obtained by partitioning outputs.
Example 8:
Realize an 8-digit ternary to binary converter (8ter2bin) by partitioning outputs [11] . Assume that we use LUTs with 10 inputs. Figure 8 shows the cascade realization. The 13 outputs are divided into two groups: The upper cascade realizes the least significant 6 bits, while the lower cascade realizes the most significant 7 bits. The total amount of memory is 2 10 (7 + 7 + 7 + 6 + 6) + 2 8 (6) = 35,328 bits.
(End of Example)
To find an optimal solution for the radix converter with many digits, the method [11] requires a large amount of computation. On the other hand, Algorithm 1 requires only small amount of computation. In the next section, we will present the more compact realizations of the radix converter by using arithmetic decomposition [12] . 
Realization Based on Arithmetic Decomposition
Arithmetic Decompositions of WS Functions
In the previous section, we presented a design method of radix converters by using LUT cascades. In this section, we will propose design methods of radix converters by using arithmetic decompositions [12] .
Theorem 3: A WS function can be represented as a sum of two WS functions as follows:
, and α is an integer. This is the arithmetic decomposition, and α is the decomposition coefficient.
α can be an arbitrary integer, where 2 ≤ α < MAX n . Let MAX B be the maximum value of WS B ( X). α need not satisfy MAX B < α, however, in this paper, we consider only cases where MAX B < α. Next, we will consider properties of WS functions obtained by arithmetic decompositions of the WS function representing a radix converter. 
Theorem 4: Consider the arithmetic decomposition for a WS( x) which represents a radix converter, where WS( x)
= n−1 i=0 w i x i = αWS A ( x) + WS B ( x), α is an integer, 2 ≤ α < p n − 1, WS A ( x) =
Arithmetic Decompositions for Different Decomposition Coefficients
A radix converter can be represented as a WS function. By using arithmetic decompositions with different decomposition coefficients α, we can realize different radix converters.
In this part, we consider two cases: One uses 2 k as the decomposition coefficient, and the other uses p k as the decomposition coefficient.
(1) When the decomposition coefficient is 2 k .
In this case, the radix converter is realized as
Since the multiplication of 2 k can be realized by the shifting, it is realized compactly. However, WS B depends on all the input variables, and the total size of the circuit is not so small.
(2) When the decomposition coefficient is p k .
The multiplication of p k increases the number of outputs for p k WS A . However, when k = n/2, the numbers of inputs for WS A and WS B can be a half of the original function, so the total network can be much smaller.
Example 9:
Let us design the 8-digit ternary to binary converter (8ter2bin). Consider two cases where the decomposition coefficients are α = 2 6 = 64 and α = 3 4 = 81. The ternary number is represented by the binary-coded ternary code. Table 3 shows the coefficients of arithmetic decompositions of 3 i (i = 0, 1, 2, . . . , 7). Note that these coefficients are equal to the weights for WS A ( x) and WS B ( x). We assume that 11-input cells are available for cascade realization. From Table 3 , we have two different realizations for 8ter2bin.
1. When the decomposition coefficient is 2 6 .
•
• WS A ( x) = 34x 7 + 11x 6 + 3x 5 + 1x 4 + 0x 3 + 0x 2 + 0x 1 + 0x 0 .
• WS B ( x) = 11x 7 +25x 6 +51x 5 +17x 4 +27x 3 +9x 2 + 3x 1 + 1x 0 .
• WS A ( x) depends on the inputs {x 4 , x 5 , x 6 , x 7 }. The number of inputs is 8. Since the output takes values from 0 to 2(1 + 3 + 11 + 34) = 98, 7 bits are necessary to represent the output. WS A ( x) has 8 inputs and 7 outputs, so it is implemented by a single cell. • WS B ( x) depends on all the inputs {x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 }, so the number of inputs is 16. Since each weight vector w i satisfies the condition of Theorem 1, the output range is [0, 2(1 + 3 + 9 + 27 + 17 + 51 + 25 + 11)] = [0, 288], 9 bits are necessary to represent it. WS B ( x) has 16 inputs and 9 outputs. It is implemented by a 3-LUTs cascade with 11-input cells.
• The multiplication by 2 6 can be implemented by shifting 6 bits positions. We add the upper 3 bits of WS B and the outputs of WS A by a 7-bit adder. Figure 9 shows the network, which uses memory with 45824 bits and a 7-bit adder.
2. When the decomposition coefficient is 3 4 .
• WS A ( x) depends on inputs {x 4 , x 5 , x 6 , x 7 }, so the number of the inputs is 8. By Theorem 1, the maximum value of the output is 2(1 + 3 + 9 + 27) = 80. It can be represented by 7 bits. Thus, WS A ( x) can be implemented by an 8-input 7-output LUT. with 5120 bits and a 13-bit adder.
Arithmetic Decomposition Using the Binary Representation of Inputs
In this part, we will introduce an arithmetic decomposition with respect to the binary representation of the inputs.
Definition 7:
Let i be an integer. BIT(i, j) denotes the j-th bit of the binary representation of i, where the LSB is the 0-th bit. An integer number i can be represented by log 2 i bits. Thus, we have the relation:
From this, we have the following:
Theorem 5: A p-nary to binary converter can be represented by
Example 11: Consider the 8-digit ternary to binary converter (8ter2bin). By Theorem 5, WS( x) can be represented as: Figure 11 shows the circuit corresponding to the above decomposition. Each cell has 8 inputs. Since 7 i=0 3 i = 3280, each cell has 12 outputs. The multiplication by two is implemented by shifting one bit position. The circuit uses 6144 bits of memories and a 13-bit adder.
We can further reduce the circuit by using Theorem 3, where 3 4 is the decomposition coefficient: Figure 12 is the network corresponding Eq. (6), where each cell has only 4 inputs. The total amount of memory is 576 bits. It uses two 12-bit adders and a 13-bit adder.
Implementation on FPGAs
To see the effectiveness of the approach, we implemented various designs of ternary to binary converters on FPGAs, and compared the amount of hardware and performance.
FPGAs and Their Development System
We used Altera Cyclone II (EP2C5T144C7) FPGA device, having 13 Embedded Multipliers (EMs) that perform the multiply-and-sum operations, 26 embedded memories (M4Ks), and 4608 logic elements (LEs). Each M4K contains 4096 bits. We used Altera Quartus II V.4.1 as the development tool. We also developed a radix converter synthesis system shown in Fig. 13 that generates Verilog-HDL codes describing various designs, and data for M4Ks. In the FPGAs, LUTs (cells) were implemented by M4Ks, while adders were implemented by LEs. HDL code from the specification: radix p and the number of digits n. We could not implement a radix converter by a single memory because size of memory was too large for this FPGA.
8-Digit Ternary to Binary Converters
• DM1 directly implements Figure 1 is the circuit generated by the Quartus II. After mapping, the Quartus II replaced the multipliers with 7 EMs, and adders with 66 LEs. It uses no M4Ks.
• DM2 also corresponds to Fig. 1 . In this case, however, the Quartus II replaced multipliers with LEs instead of EMs. So, the circuit consists of LEs only. It has 195 LEs, which means 129 LEs were replaced by 7 EMs. It is faster than DM1, since LEs perform constant multiplications faster than EMs.
Arithmetic Decomposition Method (AD):
The system generated Verilog-HDL codes and data for M4Ks.
• AD1 corresponds to Fig. 9 , which was obtained with the decomposition coefficient 2 6 . The Quartus II replaced four LUTs with 13 M4Ks, and the adder with 8 LEs.
• AD2 corresponds to Fig. 10 , which was obtained with the decomposition coefficient 3 4 . The Quartus II replaced two LUTs with two M4Ks, and the adder with 13 LEs.
• AD3 corresponds to Fig. 11 , which was obtained with the arithmetic decomposition using binary representation of inputs. The Quartus II replaced two LUTs with two M4Ks, and the adder with 12 LEs.
• AD4 corresponds to Fig. 12 , which was obtained by the arithmetic decomposition with the coefficient 3 4 and using binary representation of inputs. The Quartus II replaced four LUTs with four M4Ks, and adders with 36 LEs. It is slower than AD3 since the adder is more complex.
Partition of Outputs Method (PAR):
• PAR corresponds to Fig. 8 , which consists of 6 LUTs. The Quartus II replaced 6 LUTs with 11 M4Ks.
In the case of 8ter2bin, we can conclude that AD3 is the best realizations: It is the fastest and requires the smallest amount of hardware. Table 5 compares 5 different designs of 12-digit ternary to binary converters (12ter2bin).
12-Digit Ternary to Binary Converters
Direct Method (DM):
• DM1 is similar to Fig. 1 , but uses 15 EMs.
• DM2 is similar to Fig. 1 . Also in this case, it is faster than DM1.
Arithmetic Decomposition Method (AD):
• AD2 is similar to Fig. 10 , but the decomposition coefficient is 3 6 . In this case, it uses a 12-input 20-output LUT, a 12-input 10-output LUT, and a 20-bit adder. The Quartus II replaced these LUTs with 30 M4Ks, and the adder with 20 LEs. So, it requires a larger FPGA, EP2C8T144C7 which contains 36 M4Ks.
• AD3 is similar to Fig. 11 , but uses a pair of 12-input 19-output LUTs and a 20-bit adder. The
Quartus II replaced these LUTs with 38 M4Ks. So, we had to use a larger FPGA, EP2C20F256C7 which contains 52 M4Ks.
• AD4 is similar to Fig. 12 , but uses the decomposition coefficient 3 6 . It uses four LUTs with 6 inputs. The Quartus II replaced these LUTs with four M4Ks, and the adders with 57 LEs.
Since AD1 uses too many M4Ks and PAR requires too much computation time, they are not used in the design. In the case of 8ter2bin, AD2 and AD3 are faster. On the other hand, in the case of 12ter2bin, AD2 and AD3 require larger FPGAs, so AD4 is the best choice.
Almost all embedded memories in recent FPGAs are synchronous. M4Ks in Cyclone II FPGAs are also synchronous. So, the circuits require clock signals to operate. Therefore, when we realize radix converters by using LUT cascades, we require at least s clocks to convert a binary number from a p-nary number. Delay [nsec] in Table 4 and 5 shows latency.
AD4 is the best choice for implementing radix converters with 12 digits.
Observations
1. In Fig. 9 , variables x 4 , x 5 , x 6 , and x 7 appear in both the upper and the lower cascades. This decomposition is a non-disjoint [1] . On the other hand, in Figs. 10, 11, and 12, each variable appears only once. These decompositions are disjoint [1] . The disjoint decomposition in Fig. 10 is easy to find from Eq. (2) or Fig. 1 , while the disjoint decomposition in Fig. 11 is not so easy to find. Also, the decomposition in Fig. 10 produces similar but different sub-circuits, while the decomposition in Fig. 11 produces two identical sub-circuits. 2. These techniques can be combined to design radix converts with more digits, and other arithmetic circuits [5] . 3. Since the propagation delay of an adder and an LUT are almost the same, we can achieve higher throughput for AD2, AD3, and AD4 by pipelining them.
Conclusion
In this paper, we presented arithmetic decompositions to design p-nary to binary converter. We used ternary to binary converters to illustrate the idea. We also implemented the converts on FPGAs to confirm the effectiveness of the methods. An interesting future work is the extension to radix converters for signed-digit numbers.
