Abstract-A balanced ternary digit, known as a trit, takes its values in {−1, 0, 1}. It can be encoded in binary as {11, 00, 01} for the direct use in digital circuits. In this brief, we study the decompression of a sequence of bits into a sequence of binary encoded balanced ternary digits. We first show that it is useless, in practice, to compress sequences of more than five ternary values. We then provide two mappings, one to map 5 bits to 3 trits and one to map 8 bits to 5 trits. Both mappings were obtained by human analysis and lead to Boolean implementations that compare quite favorably with others obtained by tweaking assignment or encoding optimization tools. However, mappings that lead to better implementations may be feasible. In many applications, the binary value representing a sequence of {−1, 0, 1} needs to be stored in memory, so finding an encoding that minimizes both the compressor and decompressor is a legitimate goal. However, our own focus is the VLSI implementation of neural networks making use of ternary weigths, in which the weight values are written in a memory only once and read almost continuously [9] . In that case, it is necessary to combinatorially produce a sequence of binary encoded balanced ternary values from an encoded binary string.
I. INTRODUCTION
Ternary encoding of data has been shown useful at least in the following contexts: general purpose computing [1] , wireless transmission [2] , [3] , texture representation in images [4] , quantum computing [5] , optical supercomputing [6] , and artificial neural networks [7] , [8] .
In many applications, the binary value representing a sequence of {−1, 0, 1} needs to be stored in memory, so finding an encoding that minimizes both the compressor and decompressor is a legitimate goal. However, our own focus is the VLSI implementation of neural networks making use of ternary weigths, in which the weight values are written in a memory only once and read almost continuously [9] . In that case, it is necessary to combinatorially produce a sequence of binary encoded balanced ternary values from an encoded binary string.
Our objective is, thus, to determine a mapping (i.e., a one-toone function that maps binary strings to binary encoded balanced ternary values) which, when implemented as a Boolean multivalued function, leads to factored-form expressions with the least number of Boolean operators and the least number of literals (considering also the outputs of previous operators) as operands of those operators. This factored-form representation is interesting because it approximates the complexity of a gate-level implementation [10] . The only constraint we have is the encoding of the ternary values, given by μ:{−1, 0, 1} → {11, 00, 01}. This choice is appropriate for use in classical two's complement arithmetic circuits, for instance, when these values are directly fed into multipliers or adders [9] .
In Section V, we show that it is not useful to compress more than 5 trits on 8 bits, and give the best mappings that we found, i.e., the TABLE I  GAIN AND FREE VALUES FOR ENCODING TRITS ON BITS ones requiring fewer gates, for compressing 3 trits on 5 bits and 5 trits on 8 bits. Note that we do not propose a general algorithm to solve the problem for sequences of any length.
II. PROBLEM FORMULATION
A ternary digit contains log 2 (3) ≈ 1.586 bits of information. We compute the maximum theoretical gain that can be obtained by compressing trits in binary. As a sequence of n trits (n ∈ N) represents 3 n values, at least log 2 (3 n ) = n log 2 (3) bits are necessary to encode this sequence in binary. For n trits, the gain compared to the nonencoded sequence is given by u n = 2n − n log 2 (3)/2n.
Since v n and w n increase monotonically, this yields by the squeeze theorem lim n→∞ u n = 1 − log 2 (3)/2 ≈ 0.2075. Table I gives the actual gain for small values of n. As can be seen, there is not much interest in encoding sequences of more than 5 trits (actually 10 bits) on 8 bits, since it is at ≈ 0.75% of the maximum achievable gain. The next higher gain, obtained for 17 trits, is given in the table for completeness.
Given b bits and t trits, we have to determine how to map 2 b values onto 3 t values so that the multilevel logic implementation is minimized, i.e., leads to the use of as few Boolean operators as possible with each of these operators having an as low number of inputs as possible. From a combinatorial point of view, these mappings are ordered arrangements, so there are 2 b 3 t = 2 b !/(2 b − 3 t ) of them, where n m represents the falling factorial. We focus on two particular instances of the problem, the mapping of 3 trits on 5 bits, leading to 32 27 ≈ 2.2 10 33 possible mappings, and the mapping of 5 trits on 8 bits, leading to 256 243 ≈ 1.4 10 497 mappings to choose from.
It is quite clear given this analysis that searching for an optimized mapping of 17 trits on 27 bits is totally unpractical.
Even for our two cases of relatively small size, given the size of the search space, exhaustive search is not an option, and finding the optimal solutions is statistically unlikely since multilevel optimization is an NP-complete problem [11] .
1063-8210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
III. RELATED WORK
This problem may seem a fairly well-known one, but to our surprise, the work of mapping a bit string representing a subset of its possible values to a subset of bit strings of smaller size containing all permutations does not seem documented in the literature.
The most extensive surveys on logic synthesis and input-output encoding and state assignment [10] , [12] target slightly different problems, making the available techniques not easily transposable. Another approach to obtain state assignment, by using symbolic representation of the states as proposed in [13] , reaches optimal solutions with more than the minimal number of bits, whereas it is critical in our case to stick to the minimum number of bits. Chapter 7.5 of De Micheli's book [14] is dedicated to these encoding problems, but again, the problems that are solved are sufficiently different from ours to make the approaches inappropriate.
Output encoding is also the subject of [15] . This brief cites all the relevant works on the topic of encoding targeting logic minimization. We refer the interested reader to its bibliography, to avoid too many citations in this correspondence. This brief makes the assumption that the mapping is already chosen, although part of the output uses symbols instead of bits, and it is the bit strings corresponding to these symbols that are searched for. Their proposition is also unfortunately not generalizable to our problem.
Bearing in mind that our goal is not to devise the best assignment for this problem in its generality, but for the two instances that we believe are useful, we focus on these problems only.
IV. EVALUATION APPROACH
We take as first approximation of the logic complexity the number of literals after multilevel minimization and technology mapping on a minimal standard cell library using the Alliance VLSI CAD tools [16] . We also tried with Espresso [17] followed by Alliance technology mapping, but the results were not consistent with Synopsys synthesis as Espresso targets multilevel programmable logic array minimization and not standard cells. The library contains exclusively a NOT and 2-, 3-and 4-input AND, OR, NAND, NOR, XOR, and XNOR gates, and uses arbitrary units for area (λ). This step allows first to rank the solutions easily and second to reproduce the results presented here, without the need to access a specific proprietary software and cell library.
As a first step, we generated 10 000 random assignments for the two interesting cases, using the mapping μ for the ternary values representation. The mapping of 32 values coded on 5 bits to the 27 legal 3 trits coded on 6 bits led to an average number of gates equal to ≈77.4 with a standard deviation of ≈6.0 after technology mapping. Regarding the assignment of 256 values coded on 8 bits to the 243 legal 5 trits coded on 10 bits, the average number of gates equals ≈886.8 with a standard deviation of ≈14.8 after technology mapping. We give here the number of gates, but we verified that, given the limited number of gates in the standard cell library, there is a very strong correlation with the area.
As a second step, and for actual implementation, we use Synopsys' design compiler targeting STMicroelectronics 28-nm fully depleted silicon on insulator (FDSOI) standard cell library at an operating point of 0.9 V. The areas and propagation times of the circuits are given as reported by the synthesis tool.
V. ENCODING SOLUTIONS
We tried several automated solutions that failed to minimize the number of gates. The Hungarian algorithm [18] is optimal for assignment problem, as long as we can provide a cost matrix. We are unable to build a relevant cost matrix. Indeed, costs are interdependent since Boolean subexpressions are shared. We attempted unsuccessfully with several cost functions being variation of the Hamming distance between the trits code and the binary values. We also tried to tweak state assignment algorithms to perform this assignment instead of state assignment. Overall, these trials produce encodings that, once synthesized, contain half the number of gates of a random assignment, but are still far from the solution we present in the following (at least twice as big for the 8 bits to 5 trits case).
In this brief, we did not consider classical algorithms used for large NP-complete problems such as simulated annealing, generic algorithms, or tabu search. To perform efficiently, these algorithms need a fast and accurate evaluation of the solutions to deal with the huge number of produced solutions during searching. Unfortunately, an application-specific integrated circuit synthesis on the chosen precharacterized library of logic cells requires at least a minute.
The solutions presented in the following were derived by hand assuming structural properties. Regarding the notation, "−" represents a "do not care," i.e., the value of the signal is irrelevant.
A. Five Bits to Three Trits
Denoting t 5..0 the binary encoded ternary values and b 4..0 the binary codes, our best assignment solution is:
We now detail how we have obtained this solution. The principle is as follows. First, we generate each trit by coding only its magnitude (t 4 , t 2 , or t 0 ), i.e., whether it is null or not. The sign is obtained thanks to one input bit only (b 4 , b 3 , or b 2 ). The codes are then (b 4 
. We call (t 4 , t 2 , t 0 ) the magnitude vector. Second, the idea is to gather codes having similar magnitude vectors in sets. In our proposal, the first set contains the eight codes associated with the magnitude vector (1, 1, 1) . The second set contains six codes, four associated with the magnitude vector (0, 1, 1) and two associated with (0, 1, 0). In this set, the most significant trit is 0. Thus, b 4 can be reused and the magnitude vector can be (0, 1, b 4 ) . Similarly, the third set contains the six codes corresponding to the magnitude vector (b 3 , 0, 1). The last set contains the last seven codes, four associated with the magnitude vector (1, 1, 0), two associated with (1, 0, 0), and one to (0, 0, 0). The first six codes can be efficiently expressed by magnitude vector (1, b 2 , 0) . When b 2 is 0, b 3 is unused. It is then used to distinguish (1, 0, 0) from (0, 0, 0). Therefore, we can extend the magnitude vector to (b 3 + b 2 , b 2 , 0) to cover all cases. These four sets are, respectively, encoded as "11," "00," "10," and "01" using b 1 b 0 .
From classical Boolean optimization [19] and factorization techniques, we derive the following equations:
This solution can be synthesized in 17 gates with an area of 17250 λ 2 with Alliance. Compared to the randomly generated cases, it is about 4.5 times smaller. Synthesis on STMicro 28-nm FDSOI using the entire standard cell library produces an area of 6.52 μm 2 (11 gates instantiated) and a propagation time of 41 ps.
B. Eight Bits to Five Trits
Again, denoting t 9..0 the binary encoded ternary values and b 7..0 the binary codes, our best assignment solution is:
This mapping, obtained using a strategy similar to the previous one, can be produced using the following Boolean equations:
t 3 = b 5 ×x 9 +b 4 ×x 8 t 4 = x 6 +x 7 , t 5 = b 6 ×x 7 +b 5 ×x 6 , t 6 = x 4 +x 5 ,
These equations lead to a circuit of 85 elementary gates (98500 λ 2 ) using Alliance. Synthesis for STMicro 28-nm FDSOI produces an area of 38.51 μm 2 (62 gates instantiated), and a propagation time of 120 ps.
C. Technology Mapping Optimization
As can be seen in the previous assignment tables, some ternary codes are generated by 2 binary codes because of the "do not cares" (5 ternary codes for our first decoder and 13 for the second one). There are potential Boolean simplification and technology mapping optimization opportunities left by assigning only one binary code per ternary code and specifying a "do not care" output for the unused binary codes. However, there is no general optimization pattern that we could find to select one of these two possibilities for the 5 (respectively 13) cases so as to minimize the number of gates resulting from the implementation. As the number of possible combinations is 2 5 = 32 for the first decoder and 2 13 = 8192 for the second one, we decided to use a brute-force approach. Indeed, these numbers are small enough that we can synthesize all these cases in a few days. 1 For all synthesized circuits, we plot the critical path as a function of the area shown in Figs. 1 and 2 . The size of the dot is proportional to the number of cases that match a given (area, time).
As expected, there are better solutions than our original handderived one, for the area and/or time. However, some of them are worse for both area and time, even though they actually are just subsets of our initial designs. The performance of the optimization process of the synthesis tool (and maybe the targeted technology) is then a key for the quality of the solutions and none can be said to be the best in all conditions, as the Pareto front in the figures shows. The search and the choice of the best tradeoff have to be done by the users of the decoder with their tool, technology, and target applications.
D. Encoding
For completeness, we also now give the number of gates, area, and propagation time for the encoding part of our decoders. If needed, Fig. 3 .
Impact of decoding (D) on an ternary ANN weight memory. Weights are written once at configuration time, and read simultaneously during inference.
the equations of the encoder can be generated by a Boolean minimization program (e.g., Espresso) using the reverse table as entry point. Note that since we are focused on reading and decompressing the code, the values written into memory may well be computed offline by software instead of requiring dedicated encoding hardware.
1) 3 Trits to 5 Bits: Using the reversed version of the mapping of Section V-A using Synopsys on STMicro 28-nm FDSOI gives an area of 18.44 μm 2 (29 gates instantiated), and a propagation time of 92 ps.
2) 5 Trits to 8 Bits: Identically, for the mapping of Section V-B, we obtain an area of 102.7 μm 2 (172 gates instantiated), and a propagation time of 178 ps using STMicro 28-nm FDSOI technology.
VI. AREA SAVINGS
As illustrated in Fig. 3 , the area overhead brought by our decoders depends only on the memory width (in bits) and not on the memory depth (number of words). Therefore, no matter how large the decoder, there will always exist a minimum number of words above which the memory area savings are higher than the decoder overhead.
In [20] , 2 the authors report an SRAM bit cell area of A B = 0.12 μm 2 in the technology we use, also for an artificial neural network (ANN) application. Given this information, we can derive rough but credible estimates of the size of a memory cut of W words of B bits each, and decide when it is interesting to use our encoding approach. We note D the number of decoders and A D the area of one decoder. For the 3-trit case, we have D = B/6. The area overhead of the decoders is
Hence, to obtain a saving of R as ratio of the original memory size, the condition is
which simplifies to
A similar argument gives in the 5-trit case Fig. 4 shows the value of W such that (2) (3-trit case) and 3 (5-trit case) hold for continuous values of R. It shows that the 3-trit approach is more interesting for small memories, whereas the highest area savings are obtained with the 5-trit approach. At the intersection point, both approaches bring overall savings of 15.4% with 691 memory words. The need for memories of such size is very common in the context of neural network architectures. For example, in the AlexNet network [21] , the largest layers feature 4096 neurons with 4096 weight values per neuron. The corresponding memories of weights, both deep and wide, are perfect candidates to ternary compression. The proposed compression approaches, completely devoid of control, are also among the few-if not the only ones-suitable for such wide memories under a sustained throughput measured in Tb/s.
Clearly, any better trits-to-bits mapping associated with an optimized logic synthesis process would enable lowering these thresholds, hence the interest in any approach that could address this class of problems.
VII. POWER CONSIDERATIONS
SRAM memories are known to consume orders of magnitude more than logic gates [22] , even with relatively small memories. This is still the case with the technology we use, although the raw data are not publicly accessible. The power-and area-of memory cuts depends much on technological features and architectural-level parameters anyway, so we provide power results as a general trend only.
We did power simulations with the STMicro 28-nm FDSOI technology, assuming a toggle rate of 50% on the inputs (standard assumption, but high in the context of ternary ANN where zero weights dominate). We observed that the 5-trit decoder consumes roughly 8× more than the 3-trit decoder. However, due to the much higher consumption of the SRAM, the 5-trit approach brings overall better power savings than the 3-trit one thanks to its better compression ratio (−20% versus −16.7% in memory width). Both decoding solutions bring around 15% power savings for small memories, e.g., 512 words, with a slight advantage for the 5-trit approach. Higher savings, closer to the 20% limit, can be obtained with the 5-trit approach for deeper and/or wider memories, e.g., around 18% overall power savings are observed with 4000 words.
In the case of external DRAM access, the power consumption of decoding (and even encoding) is so insignificant compared to DRAM operations that the proposed approaches would bring a solid 16.6% power savings for the 3-trit approach and 20% savings for the 5-trit, along with similar reduction in memory size requirements.
VIII. CONCLUSION
In this brief, we address the problem of efficiently decompressing a vector of bits into binary encoded trits. We first show that it is neither necessary nor practical to compress more than 5 trits into 8 bits, and then give two optimized mappings and their corresponding multivalued and multilevel Boolean function. These mappings were obtained by human reasoning, and no automatic method we could think of gave better or even approaching results. It is left as an open problem to know if better mappings exist.
In conclusion, the proposed approaches bring noticeable savings both on area and power, which makes them essential in all classes of applications where ternary values are stored in memory and read frequently.
