## A Design Fr anework for Invertible Logic

| 著者 | Oni zawa Naoya, $N$ shi no Kaito, Smithson Sean C., Meyer Brett H., Gross Warren J., Yamagata Hitoshi, Fuj ita Hiroyuki, Hanyu Takahi ro |
| :---: | :---: |
| j our nal or publication title | I EEE transactions on computer-ai ded desi gn of integrated circuits and systens : a publication of the IEEE Circuits and Systens Soci ety |
| page range | 312-316 |
| year | 2020-06-22 |
| URL | ht t p: //hdl . handl e. net /10097/00130848 |

# A Design Framework for Invertible Logic 

Naoya Onizawa, Member, IEEE, Kaito Nishino, Nonmember, IEEE, Sean C. Smithson, Student Member, IEEE, Brett H. Meyer, Senior Member, IEEE, Warren J. Gross, Senior Member, IEEE, Hitoshi<br>Yamagata, Nonmember, IEEE, Hiroyuki Fujita, Nonmember, IEEE, and Takahiro Hanyu, Senior Member, IEEE


#### Abstract

Invertible logic using a probabilistic magnetoresistive device model has been recently presented that can compute functions in bidirectional ways and solve several problems quickly, such as factorization and combinational optimization. In this paper, we present a design framework for invertible logic circuits. Our approach makes use of linear programming to create a Hamiltonian library with the minimum number of nodes for small invertible-logic functions In addition, as the device model is approximated based on stochastic computing in synthesizable SystemVerilog, a faster simulation using the compiled SystemC binary is realized than a conventional SPICElevel simulation and is verified using field-programmable gate array (FPGA) as prototyping. Using our design framework, several invertible-logic circuits are designed and emulated (verified) in SystemC, exhibiting five order-of-magnitude faster simulation than a conventional work.


keywords-Stochastic computing, Hamiltonian, SystemVerilog model, FPGA

## I. Introduction

Invertible logic has been recently presented for providing a capability of forward and backward operations [1] as opposed to typical binary logic for the forward operation. It is designed based on underlying Boltzmann machines [2] and probabilistic magnetoresistive device models (p-bits) [3] whose input and output signals are represented by random bit streams. The bidirectional computing capability is realized by reducing the network energies of the machines with noise control (e.g. a multiplier could be used as a factorizer in the backward mode). Hence, several challenging problems could be quickly solved, such as integer factorization (e.g. cryptography problems [4]), combinatorial optimization (e.g. wireless sensor networks [5]), and machine learning (e.g. training neural networks [6]).

[^0]However, there are several issues for designing largescale invertible logic circuits. The functions that operate in bidirectional modes are defined by the Boltzmann machine configurations (Hamiltonians). A method of generating Hamiltonian is limited to small function blocks, such as Boolean logic [7], [8]. Another issue is the simulation speed due to the complicated device model described at the transistor level. In [9], the model is emulated in software using a microcontroller, however, the simulation speed per sample is not fast (e.g. 100-300 ms ).

In this paper, a design framework for large-scale invertible logic is presented in order to tackle the two main issues: network configurations (Hamiltonians) and simulation speed. For the small network configurations, a Hamiltonian library is created based on linear programming (LP), which provides the minimum number of nodes in Hamiltonians for basic functions including adders and nonlinear functions. In addition, Hamiltonians of large functions (e.g. multiplication) can be constructed by adding those of small function blocks. For faster simulations, the probabilistic device model is approximated using stochastic computing [10]-[12] in synthesizable SystemVerilog. Stochastic computing that uses random bit streams realizes area-efficient computation blocks (e.g. multiplication and tanh function) and has been recently used for several applications, such as low-density parity-check decoders [13]-[16], image processing [17], [18], and deep neural networks [19]. As invertible logic may operate as serial computing, stochastic computing efficiently approximates the device model. Therefore, invertible logic can be emulated (verified) in the compiled SystemC environment and verified in the prototyping hardware (FPGA). Using our design framework, two noise-control methods are introduced and discussed in terms of convergence speed.

Our contributions are summarized: 1) the first design framework for invertible logic from specification to simulation, 2) Hamiltonian design using linear programming with the minimum number of nodes, and 3) five order-ofmagnitude faster simulation than conventional works. The rest of the paper is as follows. Section II reviews invertible logic with related works and discusses the current issues of invertible logic design. Section III describes an overview of the proposed design framework for invertible logic. Section IV introduces a creation of Hamiltonian library using


Fig. 1. Invertible logic: (a) concept, (b) invertible AND, (c) Hamiltonian of invertible AND, and (d) state probabilities when Y is fixed to 0 .

LP and a method of designing large-scale invertible logic. Section V models the probabilistic device model (p-bits) using stochastic computing for fast simulation. Section VI introduces two noise-control optimization methods for fast convergence of invertible logic. Section VII evaluates the proposed design framework with the conventional work in several aspects, such as Hamiltonian and simulation speed. Section VIII concludes the paper.

## II. Preliminary

## A. Invertible logic

Fig. 1 (a) shows a concept of invertible logic realized using Boltzmann machine and probabilistic bits (p-bits) [1]. Invertible logic circuits operate at forward and/or backward modes, where functions are embedded using Hamiltonian with inputs $\left(x_{i} \in\{0,1\}(1 \leq i \leq p)\right)$ and outputs $\left(y_{i} \in\{0,1\}(1 \leq i \leq q)\right)$. Note that the 2's complement format is used to represent data in invertible logic throughout this paper. For example, an invertible multiplier exhibits a
capability of multiplication with fixed inputs (forward mode) and factorization with fixed outputs (backward mode). If partial inputs and outputs are fixed, the invertible multiplier operates as division.

Fig. 1 (c) shows a Hamiltonian of a two-input AND corresponding to the gate shown in Fig. 1 (b). There are three nodes, where weight values $(J)$ between nodes and bias values at nodes are given by:

$$
\begin{align*}
& \mathrm{h}_{\mathrm{AND}}=\left[\begin{array}{lrr}
+1 & +1 & -2
\end{array}\right],  \tag{1a}\\
& \mathrm{J}_{\mathrm{AND}}=\left[\begin{array}{rrr}
0 & -1 & +2 \\
-1 & 0 & +2 \\
+2 & +2 & 0
\end{array}\right], \tag{1b}
\end{align*}
$$

where the first two rows correspond to $A$ and $B$ and the last row corresponds to $Y$. Hamiltonians of simple logic gates can be obtained using ground-state spin logic [7], [8]. With given $h$ and $J$, each node (p-bit) probabilistically generates an output $\left(m_{i} \in\{-1,1\}(1 \leq i \leq l)\right.$ ), where $l$ is the number of nodes. $m_{i}$ is given by the following equations:

$$
\begin{gather*}
m_{i}(t+\tau)=\operatorname{sgn}\left(\operatorname{rnd}(-1,+1)+\tanh \left(\mathrm{I}_{\mathrm{i}}(\mathrm{t}+\tau)\right)\right)  \tag{2a}\\
I_{i}(t+\tau)=I_{0}\left(h_{i}+\sum_{j} J_{i j} m_{j}(t)\right) \tag{2b}
\end{gather*}
$$

where $r n d(-1,+1)$ is a uniformly distributed random (real) number between -1 and +1 , $\operatorname{sgn}$ is the sign function (with binary +1 or -1 outputs) and $I_{0}$ is a scaling factor (an inverse pseudo-temperature). As $m_{i}$ is represented in bipolar format, " $m_{i}=+1$ " and " $m_{i}=-1$ " correspond to logic values of ' 1 ' and ' 0 ', respectively.

Energies $(H)$ of invertible logic circuits are given by:

$$
\begin{equation*}
H=-\sum_{i} h_{i} m_{i}-\sum_{i<j} J_{i j} m_{i} m_{j} \tag{3}
\end{equation*}
$$

By controlling noise levels using several parameters, such as $I_{0}, H$ ideally decreases to the global minimum, leading to desired inputs and/or outputs. Fig. 1 (d) shows an example of the two-input invertible AND in the backward mode. With fixing the output $(Y)$ to ' 0 ' (" $m_{y}=-1$ "), there are three valid states ('ABY') of (' 000 ', ' 010 ', ' 100 '). In this simulation, the three valid states are obtained with almost the same probability of $33 \%$.

## B. Related works

Table I summarizes comparisons of logic family characteristics. Unlike conventional Boolean logic that realizes only forward operations, invertible logic can realize bidirectional (forward/backward) operations. The number of inputs and outputs are flexible, while computation is deterministic or

TABLE I
COMPARISONS OF LOGIC-FAMILY CHARACTERISTICS.

| Logic family | Operation | Function representation | I/O conditions | Physical realization |
| :---: | :---: | :---: | :---: | :---: |
| Conventional | Forward | Boolean algebra | $n$-input $/ m$-output | CMOS circuit |
| Reversible | Forward / Backward | BDD, matrix representation | $n$-input $/ n$-output | Quantum circuit |
| Invertible | Forward / Backward | Hamiltonian | $n$-input $/ m$-output | Probabilistic circuit |

probabilistic in conventional and invertible logic, respectively. Invertible logic is designed using a probabilistic device model and can be implemented using a magnetoresistive device [1].

Reversible logic circuits are constructed of special gates (such as Controlled NOT (CNOT) or Toffoli gates) having a direct one-to-one mapping of inputs to outputs [20]. While reversible logic gates allow for circuits to be built which are bidirectional, they must be designed differently and do not include standard gates (such as AND or OR gates) and require different design methods, such as binary decision diagrams (BDD) [21]. While both reversible and invertible logic circuits reconstruct inputs from a given output value, they differ at fundamental levels. Unlike invertible logic, the number of inputs is equal to the number of outputs, which could require additional outputs/inputs, such as even a simple AND gate in reversible logic [22]. For physical realization, gates of reversible logic used in quantum circuits can be converted to standard binary logic that can be in turn realized in standard CMOS.

## C. Design issues with invertible logic

There are two main issues for large-scale invertible-logic circuits. The first issue is a Hamiltonian design method. Different from reversible logic, large-scale functions (e.g. multiplication) are represented using a corresponding Hamiltonian as it is designed by adding small Hamiltonians based on ground-state spin logic [7], [8]. However, a variety of Hamiltonians is limited to small functions, such as AND. In addition, there is no specific design method of creating Hamiltonians corresponding to other functions. The second issue is simulation speed. Small invertible logic circuits have been designed and simulated at the transistor level [1] and in a microcontroller (software) [9], which takes 100-300 ms for a cycle of operation. For designing large-scale invertible logic, slow simulation could be a critical issue. Especially, as a control of noise effect, $I_{0}$, in Eq. (2b) is required to converge to a valid state (minimum energy), a parameter search of the noise effect is required. A fast simulator allows designers to find a good noise parameter quickly. In this paper, these two issues are mainly tackled using the proposed design framework for large-scale invertible circuits, such as design methodology of large variety of Hamiltonians, fast simulation environment and noise-control optimization.


Fig. 2. Proposed design framework for invertible logic.

## III. DESIGN FRAMEWORK

Fig. 2 shows the proposed design framework for invertible logic. Let us explain the framework from the beginning.

- First, a circuit design specification is defined, such as desired functions and input/output bit widths.
- Second, a whole network Hamiltonian corresponding to the function is generated based on a Hamiltonian library. The Hamiltonian library is preliminarily created using linear programming (LP) described in Section IV, where Hamiltonians of small invertible logic circuits are included in the library, such as logic functions and adders. The whole Hamiltonian is obtained by adding the small Hamiltonians using our custom Python tool.
- Third, the whole Hamiltonian is converted to the corresponding SystemVerilog model using SystemVerilog primitive modules. The primitive modules are preliminarily designed using stochastic computing [10] described in Section V, where stochastic computing approximates the probabilistic device model. The SystemVerilog model generated using our custom Python tool is synthesizable using commercial EDA tools, such as Synopsys Design Compiler.
- Fourth, a test bench is created with noise control of parameters, such as $I_{0}$. In invertible logic, the convergence speed could be significantly changed due to the noise control including hyper parameters, where a selection of optimum parameters can reach the global minimum

| A | B | Y | State |
| :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | Valid |
| 0 | 0 | 1 | Invalid |
| 0 | 1 | 0 | Valid |
| 0 | 1 | 1 | Invalid |
| 1 | 0 | 0 | Valid |
| 1 | 0 | 1 | Invalid |
| 1 | 1 | 0 | Invalid |
| 1 | 1 | 1 | Valid |$\quad$| $\mathrm{m}_{\mathrm{A}}$ | $\mathrm{m}_{\mathrm{B}}$ | $\mathrm{m}_{\mathrm{Y}}$ | H (Energy) |
| :---: | :---: | :---: | :---: |
| -1 | -1 | -1 | $\mathrm{E} 0=\mathrm{Emin}$ |
| -1 | -1 | 1 | $\mathrm{E} 1 \geq \mathrm{Emin}+\mathrm{d}$ |
| -1 | 1 | -1 | $\mathrm{E} 2=\mathrm{Emin}$ |
| -1 | 1 | 1 | $\mathrm{E} 3 \geq \mathrm{Emin}+\mathrm{d}$ |
| 1 | -1 | -1 | $\mathrm{E}=\mathrm{Emin}$ |
| 1 | -1 | 1 | $\mathrm{E} 5 \geq \mathrm{Emin}+\mathrm{d}$ |
| 1 | 1 | -1 | $\mathrm{E} 6 \geq \mathrm{Emin}+\mathrm{d}$ |
| 1 | 1 | 1 | $\mathrm{E} 7=\mathrm{Emin}$ |

Fig. 3. Example of Hamiltonians design of an invertible AND $(Y=A \wedge B)$ using linear programming (LP).
energy. Two noise-control methods are introduced in Section VI.

- Fifth, the invertible logic circuit using the SystemVerilog model is verified (emulated) using Verilator [23] that is faster than SPICE simulations and interpreted Verilog simulations, where Verilator compiles SystemC test benches and the SystemVerilog models. Using the fast simulation environment, hyper parameters for fast convergence to the global minimum can be optimized (noise-control optimization) described in Section VI. In addition, the SystemVerilog model can be verified using FPGA boards for quite large invertible circuits as prototyping through commercial FPGA design tools, such as Xilinx Vivado.


## IV. Hamiltonian design

## A. Hamiltonian library of small invertible logic using linear programming (LP)

Hamiltonians of small functions blocks, such as logic gates, are obtained using linear programming (LP). Fig. 3 illustrates an example of Hamiltonian design of an invertible AND $(Y=A \wedge B)$. There are total eight states that are divided into valid and invalid states based on the AND function.

Let us explain a procedure of generating a Hamiltonian using the invertible AND. The inputs $\left(x_{i} \in\{0,1\}(1 \leq i \leq p)\right)$ and the outputs $\left(y_{i} \in\{0,1\}(1 \leq i \leq q)\right)$ are defined shown in Fig. 1 (a). First, logical values are converted to bipolar format as $m_{i}$. Second, an energy of each state $\left(E_{k}(1 \leq k \leq(l+l(l-1) / 2))\right)$ is defined based on Eq. (3), where $l$ is a summation of input and output bit widths. In this case, $l$ is 3 and the maximum $k$ is 6 . In invertible logic, the energies of the valid states must be equal to the minimum ( $E_{\text {min }}$ ) while that of the invalid states are larger than $E_{\text {min }}$ described as following:

$$
E_{k}=\left\{\begin{array}{c}
-\sum_{i} h_{i} m_{i}-\sum_{i<j} J_{i j} m_{i} m_{j}=E_{\min }  \tag{4}\\
\left(f\left(x_{1} \ldots x_{p}\right)=\left(y_{1} \ldots y_{q}\right)\right) \\
-\sum_{i} h_{i} m_{i}-\sum_{i<j} J_{i j} m_{i} m_{j} \geq E_{\min }+d \\
(\text { otherwise })
\end{array}\right.
$$

TABLE II
Hamiltonian library generated using LP.

| Function | Minimum number of nodes |
| :---: | :---: |
| AND, OR, NAND, NOR | 3 |
| XOR, XNOR, HA | 4 |
| FA | 5 |
| $n$-bit adder | $(3 n+1)$ |
| $n$-input bitcount | $n+\left\lceil\log _{2}(n+1)\right\rceil$ |
| $n$-bit ReLU | $2 n$ |

where $d$ is the energy difference between $E_{\min }$ and the second minimum energy. Third, the objective function is maximizing $d$ using LP in order to obtain $h_{i}$ and $J_{i j}$ as follows:

$$
\begin{array}{rc}
\operatorname{maximize} & d \\
\text { subject to } & E q .(4) \tag{6}
\end{array}
$$

where $m_{i}$ and $m_{j}$ are constants and $h_{i}, J_{i j}, E$ and $d$ are variables.

```
Listing 1. A part of python code of LP with PuLP for the invertible-AND
Hamiltonian.
import pulp
problem = pulp.LpProblem('and', pulp.LpMinimize)
# Definition of variables
E = pulp.LpVariable('E',-3,-1,' Continuous')
h = pulp.LpVariable.dicts('h',([0],[0,1,2]),-2,2,'Integer'
)
j = pulp.LpVariable.dicts('j',([0],[0,1,2]),-2,2,'Integer'
    )
for r in range(3): # Object function
        H_hp = H_hp + arr[0][r]*h[0][r]
for }r\mathrm{ in range(2):
    ccc = r+1
    for c in range ( }\operatorname{ccc},3)\mathrm{ :
        H_jp = H_jp + arr[0][r]*arr[0][c]*j[0][0]
problem += (-1)*H_hp-H_jp-E
for i in range(0,col): # Constraint conditions
    if IN_A*IN_B == OUT_C:
        ///
        problem += (-1)*H_h-H_j-E == 0
    else:
        ///
        problem += (-1)*H_h-H_j-E-1 >= 0
status = problem.solve() # Solving this LP
```

Hamiltonians are obtained using LP with PuLP [24]. Listing 1 shows a part of python code of LP for the invertibleAND Hamiltonian. Using this method, Hamiltonians of small functions blocks are obtained in Table II. The number of nodes is the summation of input and output bit widths. These numbers are the minimum value because there is not auxiliary bit (node). Note that the auxiliary bits are extra bits except the input and the output bits (see Fig. 5 (a)).

In addition to the logic functions and the adders, Hamiltonians of several unique functions, such as bitcount function and Rectified Linear Unit (ReLU) function can be obtained using LP. The reason of creating these Hamiltonians is that these functions are often used for machine learning as


Fig. 4. Examples of Hamiltonian that could be used for machine learning: (a) bitcount function with 6 inputs of $\left(x_{1}, x_{2}, \ldots, x_{6}\right)$ and a 3-bit unsigned output of $Y=\left(y_{2}, y_{1}, y_{0}\right)$ and (b) 5-bit ReLU function with a 5-bit signed input of $X=\left(x_{4}, x_{3}, x_{2}, x_{1}, x_{0}\right)$ and a 5-bit signed output of $Y=$ ( $y_{4}, y_{3}, y_{2}, y_{1}, y_{0}$ ) in 2's complement format.
building blocks in neural networks [25]-[27]. Both functions are activation functions of neural networks, where the bitcount function is used in binary neural networks. By using these building blocks, invertible logic could be applied for machine learning, especially training neural networks using the bidirectional operations of invertible logic [6].

Fig. 4 (a) shows a Hamiltonian example of a 6-input bitcount function with 6 inputs of $\left(x_{1}, x_{2}, \ldots, x_{6}\right)$ and a 3bit unsigned output of $Y=\left(y_{2}, y_{1}, y_{0}\right)$ in 2 's complement format. The invertible bitcount circuit can realize $Y=$ $\sum_{i=1}^{6} x_{i}$ in forward and backward modes. Fig. 4 (b) shows a Hamiltonian example of a 5-bit ReLU function with a 5-bit signed input of $X=\left(x_{4}, x_{3}, x_{2}, x_{1}, x_{0}\right)$ and a 5-bit signed output of $Y=\left(y_{4}, y_{3}, y_{2}, y_{1}, y_{0}\right)$, where the function of $\operatorname{ReLU}$ is defined by $Y=\max (0, X)$.

## B. Hamiltonian construction for large invertible logic

Hamiltonians of large and/or complicated functions, such as multiplication, cannot be directly generated using LP because of linear separability problems. Hence, auxiliary bits are required to create such Hamiltonians. The whole Hamiltonian can be created by adding small Hamiltonians as follows:

$$
\begin{align*}
h & =\sum_{k} h_{k},  \tag{7}\\
J & =\sum_{k} J_{k}, \tag{8}
\end{align*}
$$

where $h_{k}$ and $J_{k}$ represent a Hamiltonian corresponding to a small circuit, such as AND, HA, and FA.

Fig. 5 shows an example of Hamiltonians of a three-input AND logic. The Hamiltonian is obtained by adding two

(a)

$$
\begin{aligned}
& \text { Addition of bias (h) } \\
& \begin{array}{|ccccc|}
\hline \text { ha } & \text { hb } & \text { hc } & \text { hd } & \text { he } \\
1 & 1 & -2 & 0 & 0 \\
0 & 0 & 1 & 1 & -2 \\
\hline 1 & 1 & -1 & 1 & -2 \\
\leftarrow & \leftarrow \text { AND1 } \\
\leftarrow \text { Total }
\end{array}
\end{aligned}
$$

(b)

(c)

Fig. 5. Hamiltonian design example of a three-input invertible AND by adding two Hamiltonians of two-input AND gates: (a) block diagram using two 2-input AND gates, (b) of $h$, and (c) addition of $J$.

Hamiltonians of the two-input AND logic. In this case, there are an additional connection (c) that becomes an auxiliary bit. If the Hamiltonian is directly created from the threeinput AND logic, the auxiliary bit could be removed, leading to the minimum number of nodes. When the number of nodes is increased due to the auxiliary bits, the hardware of invertible logic could be larger and the convergence speed could be slower.

For designing Hamiltonians of large invertible logic, a circuit architecture is a important factor that can affect the performance of invertible logic. Fig. 6 (a) shows a $4 \times 4$-bit unsigned multiplier architecture based on a simple adderbased structure. This design includes $(2 \times 4)$ inputs, 8 outputs and 32 internal connections. The Hamiltonian is obtained by adding that of AND and FA generated using LP. The number of nodes in the Hamiltonian is 48 . The number of internal connections (auxiliary bits) is exponentially increased when the input bit width is increased because of horizontal and vertical internal connections. Note that a wellknown Wallace-tree structure for fast multiplier design in conventional logic [28] causes a larger number of internal signals (nodes) than the adder-based structure.

In order to obtain smaller number of nodes, the proposed multiplier is designed using the bitcount circuits as shown in Fig. 6 (b). As there is no internal connection in the bitcount circuit, the vertical internal connections can be eliminated, leading to smaller number of nodes. In case of the $4 \times 4$-bit multiplier, the number of internal connections (auxiliary bits) decreases to 26 and hence the total number of nodes decreases to 42 . The reduction method is much more effective in larger multipliers.

Fig. 7 compares the number of nodes in invertible multipliers (factorizers). The number of nodes is exponentially increased in the conventional adder-based multiplier because horizontal and vertical internal connections (auxiliary bits)


Fig. 6. 4x4-bit unsigned multiplier architecture for constructing Hamiltonians: (a) adder-based structure and (b) bitcount-based structure (proposed) that decreases the number of vertical interconnections (auxiliary bits).
are required. As the proposed bitcount-based structure eliminates the vertical internal connections, the number of nodes is almost linearly increased, leading to significant reductions in the number of nodes. As a result, the number of nodes in the $4 \times 4$-bt and the $12 \times 12$-bit multipliers are reduced by $80.6 \%$ and $89.1 \%$, respectively. The detailed evaluation is described in Section VII-E.

## V. SystemVerilog model Using Stochastic Computing

## A. Binary and integral stochastic computing

In invertible logic, p-bits operate based on Eq. (2) with bias values ( $h$ ) and weight values $(J)$ of Hamiltonians. In order to realize faster simulations than conventional works at the transistor level [1] and Microcontroller (software) [9], a SystemVerilog model corresponding to Eq. (2) is created.


Fig. 7. Number of nodes in invertible multipliers (factorizers). The proposed structure realizes almost a linear growth of nodes in proportion to the input bit width while the exponential growth occurs in the conventional structure.
Binary stochastic bitstream (bipolar coding)

$$
\mathrm{s}=(2 * \mathrm{E}[\mathrm{~S}]-1)=4 / 8
$$

$$
S: 1,0,1,0,1,1,1,1(\mathrm{E}[\mathrm{~S}]=6 / 8)
$$

(a)

(b)

Fig. 8. Binary and integral stochastic computing in bipolar coding: (a) a multiplier of an integer stochastic bitstream and a binary stochastic bitstream ( $y=a * s$ ), and (b) a stochastic tanh function realized using a saturated updown counter.

The SystemVerilog model is designed based on binary and integral stochastic computing [10], [19].

In stochastic computing, data values are represented by frequencies of ' 1 ' in bit streams. Let us denote by $S \in$ $\{0,1\}$ a random bit streams. A real number, $s \in[-1: 1]$, is represented by $(2 * \mathrm{E}[S]-1)$ in binary stochastic computing in bipolar format, where $\mathrm{E}[S]$ denotes the expected value of the random variable, $S$. In contrast, in case of integral stochastic computing, one or more bit streams are concurrently used to represent data values in larger ranges than that of binary stochastic computing. Let us denote by $X \in\{-r,-(r-1), \ldots, r\}$ a random bit streams, where $r \in\{1,2, \ldots\}$. A real number, $x \in[-r: r]$, is represented by $\mathrm{E}[X]$ in signed format, where $\mathrm{E}[X]$ denotes the expected value of the random variable $X$.


Fig. 9. Spin-gate circuit (SystemVerilog model) using stochastic computing, which corresponds to Eq. (10).

Stochastic computing realizes several functions, such as addition, multiplication and nonlinear functions (See detail in [12]). Fig. 8 (a) shows a multiplier of an integer stochastic bitstream and a binary stochastic bitstream ( $y=a * s$ ) designed using a two-input multiplexor. Fig. 8 (b) shows a tanh function block (Stanh) using a finite state machine (FSM) in stochastic computing. The tanh function is approximated using Stanh as follows:

$$
\begin{equation*}
\operatorname{Stanh}\left(2 \cdot \mathrm{~N}_{\mathrm{T}}, \mathrm{x}\right) \approx \tanh \left(\mathrm{x} \cdot \mathrm{~N}_{\mathrm{T}}\right) \tag{9}
\end{equation*}
$$

where $2 \cdot N_{T}$ is the total number of states of the FSM. The Stanh block is designed using a saturated updown counter in hardware.

## B. Spin-gate circuit for modeling p-bits

Fig. 9 shows a spin-gate circuit (SystemVerilog model) using binary and integral stochastic computing. This model approximates the original equation of Eq. (2) as follows:

$$
\begin{equation*}
m_{i}(t+\tau) \simeq \operatorname{sgn}\left(\tanh \left(\mathrm{I}_{\mathrm{i}}(\mathrm{t}+\tau) \cdot \mathrm{N}_{\mathrm{T}}\right)\right) \tag{10a}
\end{equation*}
$$

$I_{i}(t+\tau) \simeq\left(h_{i}+\sum_{j} J_{i j} m_{j}(t)+w_{r n d} \cdot \operatorname{sgn}(\operatorname{rnd}(-1,+1))\right)$.
where $I_{0}$ of Eq. (2) corresponds to $N_{T}$. In addition, the weighted noise source with corresponding weight denoted as $w_{r n d}$ is an additional parameter from Eq. (2). The weighted noise source is generated using a random number generator [29]. The model is designed as an extended version of [30], [31], which can support controlling $I_{0}$ using $N_{T}$. The inputs and the output of the spin-gate circuits $\left(m_{i}\right)$ are represented in binary stochastic computing in bipolar format as stochastic bit streams, $s_{i}=\left(1+m_{i}\right) / 2$. Instead, integral stochastic computing is exploited inside the spin-gate circuits in order to deal with integer values of $h$ and $J$. As the


Fig. 10. Monotonic noise reduction (MNR) with grid search. The noise, $w_{r n d}$, is linearly decreased in order to converge energy to the global minimum.
model is fully designed using stochastic computing, it is synthesizable for standard digital CMOS circuits.

## VI. Noise-control optimization

In invertible logic, it is important to control noise effects in order to reach the global minimum of energy (Hamiltonian). To converge node states to that at the global minimum, $N_{T}$ and/or $w_{r n d}$ of Eq. (10) can be controlled as noise optimization. In this paper, $w_{r n d}$ is selected to be controlled for two scenarios.

## A. Grid search on monotonous noise reduction (MNR)

To find the optimum control of $w_{r n d}$, a grid search is used as shown in Fig. 10. In the grid-search method, $w_{r n d}$ is linearly decreased using four parameters as follows:

- RND_WEIGHT: the maximum value of $w_{r n d}$
- RND_STEP: the amount of noise drops
- $N_{s}$ : the number of noise drops defined by $2^{R N D \_D E C A Y-1}$
- $T$ : the number of cycles at the same $w_{r n d}$

These four parameters are swept in order to converge energy to the global minimum. The monotonic noise reduction (MNR) method derives from [32] that monotonically increases $I_{0}$. This method is simple, but it requires long simulation time to find good noise parameters. The detailed simulation results are summarized in Section Section VII-D.

## B. Tuning Parameter Repeat (TPR) with pulsed noise

In order to reduce the simulation time of finding good noise parameters, Tuning Parameter Repeat (TPR) is introduced. In this method, a short pulsed noise is repeatedly applied as opposed to MNR based on grid search. Fig. 11 shows a noise control based on TPR with an example of factorization of $756\left(i n_{-} A \times i n_{-} B\right)$. In TPR, $w_{r n d}$ is decreased from large to small as a one shot. There are three parameters in TPR as follows:

- $R N D_{-} W E I G H T$ : the maximum value of $w_{r n d}$
- RND_STEP: the amount of noise drops


Fig. 11. Noise control based on Tuning Parameter Repeat (TPR) with an example of factorization of $756\left(i n_{-} A \times i n_{-} B\right)$. A short pulsed noise is repeatedly applied to obtain the correct values. In this case, at the third trial, correct input values of 27 and 28 are obtained.

TABLE III
Comparisons of number of nodes in Hamiltonian.

|  | Conventional [33] | Proposed |
| :---: | :---: | :---: |
| AND | 3 | 3 |
| Full adder | 14 | 5 |
| 32-bit adder | 434 | 97 |

- $T$ : cycles at large or small noise

Hence, a cycle of one shot is $\left(2 * T+R N D \_S T E P\right)$.
As invertible logic is probabilistic, the results (energies) can be different, if the same noise parameter is applied. This example shows a factorization of 756 using TPR with $R N D \_W E I G H T=6, R N D \_S T E P=4$, and $T=6$. The tuning parameters were determined using simulations of a small invertible factorizer and can be applied to larger invertible factorizers. In this example, at the first and second trials, the correct input values of $i n_{-} A$ and $i n_{-} B$ are not obtained. In contrast, at the third trial using the same noise parameters, the correct input values of 27 and 28 are obtained. The comparison results with the grid search are summarized in Section VII-D.

## VII. Evaluation

## A. Comparisons of Hamiltonian

The Hamiltonian library is created using LP with PuLP [24] in AMD Opteron 6282 SE ( 2.6 GHz ) used for all the simulations. Table III summarizes the number of nodes in Hamiltonians in comparison with a conventional work [33]. The conventional method is based on [1] that uses auxiliary bits and a handle bit to create Hamiltonians, causing larger number of nodes. In contrast, the proposed method using LP generates the minimum number of nodes for the Hamiltonians of AND, FA, and 32-bit adder. The number of nodes in FA and the 32 -bit adder are reduced


Fig. 12. Simulation results of a seven-input bitcount function in the backward mode with MNR. Given a fixed output of $Y$, seven inputs are correctly obtained at : (a) $Y=2$ and (b) $Y=6$.
(a)




Fig. 13. Simulation results of a 10 -bit ReLU function in the backward mode with MNB. Given a fixed output of $Y, 10$-bit inputs are correctly obtained at: (a) $Y=121$ and (b) $Y=0$.
by $64.3 \%$ and $77.7 \%$, respectively, in comparison with the conventional method.

## B. Simulation of invertible logic circuits

Invertible logic circuits are simulated using our SystemVerilog model with the compiled SystemC binary in Verilator [23] and SystemC-2.3.2. Verilator is a fast VerilogHDL simulator running on C++ and SystemC, which accepts synthesizable Verilog-HDL and SystemVerilog. Fig. 12 shows simulated waveforms of a seven-bit bitcount funciton in the backward mode. The output of $Y$ is fixed in order to obtain correct seven inputs of $\left(x_{1}, x_{2}, \ldots, x_{7}\right)$ at


Fig. 14. Factorization results using SystemVerilog model with : (a) ( $A \times$ $B)=182$ and (b) $(A \times B)=598$. In both cases, the outputs are correctly factorized. The noise-control parameters are obtained based on the grid search.
$Y=2$ and $Y=6$ with a noise control of MNR, where $R N D D_{-} W E I G H T=16, R N D \_S T E P=2, N_{s}=7$, and $T=10$ are used. When the energy defined by Eq. (3) reaches the global minimum of -14 , the correct inputs are obtained.

Fig. 13 show simulated waveforms of a 10 -bit ReLU function in the backward mode with the same noise control used in the previous simulation. When the output of $Y$ is fixed to 121 in Fig. 13 (a), the input of $X$ reaches the correct value of 121 at the global minimum energy of -28 . In contrast, when the output is fixed to 0 in Fig. 13 (b), the input can be any negative values as the correct values because of the function: $Y=\operatorname{ReLU}(X)=\max (0, X)$.

Fig. 14 shows simulated waveforms of the invertible factorizer (adder-based) based on the architecture of Fig. 6 (a). For small invertible-logic circuits simulated in the previous paragraph, it is easy to reach the global minimum energy with many different noise parameters. In contrast, for large invertible-logic circuits, such as invertible multipliers, specific noise parameters are required for the convergence. In order to converge to the global minimum, first, the grid search on MNR is used to find the optimum control of $w_{r n d}$ of Eq. (10). The total number of cycles is $9.5 \times 10^{7}$ in the grid search to find the optimum noise parameters on MNR. In case of $(A \times B)=182, w_{r n d}$ is reduced with $R N D_{-} W E I G H T=8, R N D_{-} S T E P=1, N_{s}=7$, and $T=128$. In contrast, in case of $(A \times B)=598, w_{r n d}$ is reduced with $R N D_{-} W E I G H T=16, R N D \_S T E P=1$, $N_{s}=15$, and $T=64$. Depending on the outputs $(A \times B)$, the optimum noise parameters are different, causing long convergence time, even though our fast simulation environ-

TABLE IV
Simulation time per sample (Cycle) in invertible multipliers (FACTORIZERS).

| Bit <br> width | Microcontroller <br> $[9]$ | SystemC <br> (Proposed) |
| :---: | :---: | :---: |
| 4-bit | $100-300 \mathrm{~ms}$ | $5.3 \mu \mathrm{~s}$ |
| 10-bit | N/A | $26 \mu \mathrm{~s}$ |
| 32-bit | N/A | $774 \mu \mathrm{~s}$ |

ment is used. The evaluation of noise control is described in Section VII-D.

## C. Comparisons of simulation speed

Table IV summarizes the simulation time per sample (cycle) in a $2 \times 2$-bit invertible multiplier (factorizer). In the conventional work [9], the complicated device models of Eq. (2) are realized using software running on a microcontroller. The sample time is slow, such as $100-300 \mathrm{~ms}$. In such the environment, a noise-control optimization of $I_{0}$ for convergence to the global minimum requires significantly large time.

In contrast, our SystemVerilog model using stochastic computing is simulated as the emulation of the device model in Verilator and SystemC-2.3.2. As a result, the cycle (sample) time is around $5.3 \mu \mathrm{~s}$, leading to around five orderof -magnitude reductions. As opposed to the conventional work, larger invertible multipliers can be also designed and simulated, such as 32 bits.

## D. Comparisons of noise control in invertible multipliers

Table V summarizes the total number of cycles and the simulation time of invertible multipliers with different noise controls described in Section VI. The total number of cycles are ones until good noise parameters for convergence are obtained. Using the grid search, both total number of cycles and the simulation time are significantly increased as the bit width is increased. As a result, in larger invertible logic circuits, it is hard to converge to the global minimum and hence obtain correct values.

In contrast, using TPR, both total number of cycles and the simulation time are negligibly increased as the bit width is increased. In case of the 16-bit invertible multiplier, the simulation time is a five order-of-magnitude faster than that of the grid search. The gap of the simulation time can be larger as the bit width is increased. Hence, TPR would be more effective for larger invertible logic circuits.

## E. FPGA implementation for prototyping

As the SystemVerilog model is synthesizable, invertible logic circuits can be evaluated using FPGA as prototyping. Table VI summarizes the synthesis results of invertible logic circuits in Xilinx Vivado 2016.4 for Digilent Genesys 2 with

TABLE V
Simulation time of invertible multipliers with different noise controls in Systemc.

|  | Grid search |  | Tuning parameter repeat (TPR) |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Total cycle | Simulation time $[\mathrm{s}]$ | Total cycle | Simulation time [s] |
| 4-bit $\times$ 4-bit (8-bit) |  | $2.7 \times 10^{3}$ | $5.9 \times 10^{2}$ | 27 |
| 5-bit $\times$ 5-bit (10-bit) | $9.5 \times 10^{7}$ | $5.2 \times 10^{3}$ | $1.9 \times 10^{3}$ | 65 |
| 6-bit $\times$ 6-bit (12-bit) |  | $7.8 \times 10^{3}$ | $8.9 \times 10^{3}$ | 69 |
| 7-bit $\times$ 7-bit (14-bit) | $7.6 \times 10^{8}$ | $9.4 \times 10^{4}$ | $4.2 \times 10^{4}$ | 120 |
| 8-bit $\times$ 8-bit (16-bit) | $1.2 \times 10^{10}$ | $1.7 \times 10^{6}$ | $2.9 \times 10^{5}$ | 180 |

TABLE VI
Synthesis results of invertible logic circuits in Digilent Genesys 2.

|  | LUT | FF | Number of nodes | Number of non-zero weights in Hamiltonian | Number of maximum <br> connections per node |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 32-bit invertible adder | 9,583 | 2,133 | 128 | 632 | 8 |
| 64-bit invertible adder | 18,324 | 3584 | 256 | 4,272 | 8 |
| 7-bit bitcount | 1,008 | 404 | 10 | 45 | 9 |
| 10-bit ReLU | 2,277 | 558 | 20 | 74 | 18 |
| 16-bit invertible multiplier | 15,430 | 2,901 | 756 | 1,440 | 16 |
| (adder-based) <br> 32-bit invertible multiplier <br> (adder-based) <br> 32-bit invertible multiplier <br> (bitcount-based) | 62,969 | 9,393 | 3,072 | 6,288 | 32 |

the clock frequency of 100 MHz . As the clock cycle is 100 MHz , the sample time is 10 ns that significantly increases simulation speed in comparison with the conventional work and the proposed SystemC summarized in Table IV. Note that generating bitstream files for FPGA takes much longer time than compiling to the SystemC binary files. Therefore, the SystemC-based environment is useful for small invertible circuits while the FPGA environment is useful large ones that require longer simulation time.

Considering the hardware resources, in general, the number of LUTs and FFs are large when the number of nodes and non-zero weights in Hamiltonian are large. Note that the number of non-zero weights are obtained from $h$ and $J$ of Hamiltonians. When adder-based and bitcount-based invertible multipliers are compared, the number of nodes in the bitcount is significantly smaller than that of the adderbased structure as described in Fig. 6. In contrast, the number of non-zero weights are larger because of denser matrix of $J$. As a result, the bitcount-based invertible multiplier reduces LUT by $7 \%$ and FF by $38 \%$ in comparison with the adderbased one.

## VIII. Conclusion

In this paper, we have presented the design framework for large-scale invertible logic. The Hamiltonian library created using linear programming provides the minimum number of nodes in Hamiltonians for basic functions, where the library includes Boolean logic, adders, bitcount, and ReLU functions. As a design example of a large invertible logic
circuit, the Hamiltonians for invertible multipliers (factorizers) are constructed using the library, resulting in more than $80 \%$ reduction in the number of nodes in comparison with that of the conventional structure. For fast simulations, the probabilistic device model used for invertible logic is approximated using stochastic computing in SystemVerilog running with the compiled SystemC binary, providing almost five orders-of-magnitude reductions in simulation time in comparison with the conventional environment. In our fast simulation environment, the tuning-parameter repeat method as noise-control optimization is introduced, reducing the convergence time by five orders-of-magnitude in comparison with the grid search method.

Invertible logic was recently presented to demonstrate integer factorization in 2017 [1] and have been studied in several aspects, such as scalability, applications, and implementations. The scalability has been studied and discussed in [32]; however, optimization algorithms for specific problems (e.g. graph coloring) have not been studied and would be a future research. One of the possible applications could be machine learning, especially training neural networks using the bidirectional operations of invertible logic [6]. In future prospects, our design framework would be useful as a design and test tool for implementing invertible logic with the probabilistic magnetoresistive devices. In addition, larger invertible logic circuits using stochastic computing with standard CMOS devices could be designed for several applications.

## Acknowledgment

This work was supported by Brainware LSI project of MEXT, Japan, JST PRESTO Grant Number JPMJPR18M5, and CANON MEDICAL SYSTEMS CORPORATION.

## References

[1] K. Camsari, R. Faria, B. Sutton, and S. Datta, "Stochastic p-bits for invertible logic," Physical Review X, vol. 7, July 2017.
[2] G. E. Hinton, T. J. Sejnowski, and D. H. Ackley, "Boltzmann machines: Constraint satisfaction networks that learn," Department of Computer Science, Carnegie-Mellon University, Tech. Rep. CMU-CS-84-119, 1984.
[3] P. Debashis, R. Faria, K. Y. Camsari, J. Appenzeller, S. Datta, and Z. Chen, "Experimental demonstration of nanomagnet networks as hardware for Ising computing," in 2016 IEEE Int. Electron Devices Meeting (IEDM), Dec 2016, pp. 34.3.1-34.3.4.
[4] J. V. Monaco and M. M. Vindiola, "Factoring integers with a braininspired computer," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 3, pp. 1051-1062, March 2018.
[5] Z. Zhou et al., "Energy-efficient optimization for concurrent compositions of wsn services," IEEE Access, vol. 5, pp. 19 994-20 008, 2017.
[6] N. Onizawa, S. C. Smithson, B. H. Meyer, W. J. Gross, and T. Hanyu, "In-hardware training chip based on cmos invertible logic for machine learning," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 5, pp. 1541-1550, May 2020.
[7] J. D. Biamonte, "Non-perturbative k-body to two-body commuting conversion Hamiltonians and embedding problem instances into ising spins," Physical Review A, vol. 77, p. 052331, 2008.
[8] J. D. Whitfield, M. Faccin, and J. D. Biamonte, "Ground-state spin logic," Europhysics Letters, vol. 99, no. 5, p. 57004, 2012.
[9] A. Zeeshan Pervaiz, L. Anirudh Ghantasala, K. Camsari, and S. Datta, "Hardware emulation of stochastic p-bits for invertible logic," Scientific Reports, vol. 7, Sept. 2017.
[10] B. R. Gaines, "Stochastic computing systems," Adv. Inf. Syst. Sci. Plenum, vol. 2, no. 2, pp. 37-172, 1969.
[11] B. Brown and H. Card, "Stochastic neural computation I: Computational elements," IEEE Trans. Comput., vol. 50, no. 9, pp. 891-905, Sept. 2001.
[12] V. C. Gaudet and W. J. Gross, Stochastic Computing: Techniques and Applications. Springer International Publishing, 2019.
[13] V. C. Gaudet and A. C. Rapley, "Iterative decoding using stochastic computation," Electronics Letters, vol. 39, no. 3, pp. 299-301, Feb. 2003.
[14] S. S. Tehrani, W. J. Gross, and S. Mannor, "Stochastic decoding of LDPC codes," IEEE Communications Letters, vol. 10, no. 10, pp. 716 -718, Oct. 2006.
[15] S. S. Tehrani, S. Mannor, and W. J. Gross, "Fully parallel stochastic LDPC decoders," IEEE Trans. on Signal Processing, vol. 56, no. 11, pp. 5692-5703, Nov. 2008.
[16] S. S. Tehrani, A. Naderi, G. A. Kamendje, S. Hemati, S. Mannor, and W. J. Gross, "Majority-based tracking forecast memories for stochastic LDPC decoding," IEEE Trans. on Signal Processing, vol. 58, no. 9, pp. 4883-4896, Sep. 2010.
[17] A. Alaghi, C. Li, and J. P. Hayes, "Stochastic circuits for real-time image-processing applications," in 50th DAC, May 2013, pp. 1-6.
[18] P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel, "Computation on stochastic bit streams digital image processing case studies," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 3, pp. 449-462, Mar. 2014.
[19] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, "VLSI implementation of deep neural network using integral stochastic computing," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2588-2599, Oct. 2017.
[20] M. Saeedi and I. L. Markov, "Synthesis and optimization of reversible circuits - A survey," CoRR, vol. abs/1110.2574, 2011. [Online]. Available: http://arxiv.org/abs/1110.2574
[21] A. Zulehner and R. Wille, "One-pass design of reversible circuits: Combining embedding and synthesis for reversible logic," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 5, pp. 996-1008, May 2018.
[22] _,"Make it reversible: Efficient embedding of non-reversible functions," in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 458-463.
[23] W. Snyder, P. Wasson, and D. Galbi, "Verilator: Convert verilog code to C++/SystemC," https://www.veripool.org/wiki/verilator, 2012.
[24] S. Mitchell, S. M. Consulting, and I. Dunning, "PuLP: A linear programming toolkit for python," 2011.
[25] M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 3123-3131.
[26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnornet: Imagenet classification using binary convolutional neural networks," CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/abs/1603.05279
[27] Y. LeCun, Y. Bengio, and G. Hinton., "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, July 2015.
[28] C. S. Wallace, "A suggestion for a fast multiplier," IEEE Transactions on Electronic Computers, vol. EC-13, no. 1, pp. 14-17, Feb 1964.
[29] S. Vigna, "Further scramblings of Marsaglia's xorshift generators," Journal of Computational and Applied Mathematics, vol. 315, no. Supplement C, pp. 175-181, 2017.
[30] K. Nishino, S. C. Smithson, N. Onizawa, B. H. Meyer, W. J. Gross, H. Yamagata, H. Fujita, and T. Hanyu, "Study of stochastic invertible multiplier designs," in 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Dec 2018, pp. 649-650.
[31] S. C. Smithson, N. Onizawa, B. H. Meyer, W. J. Gross, and T. Hanyu, "Efficient CMOS invertible logic using stochastic computing," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 6, pp. 2263-2274, June 2019.
[32] K. Y. Camsari, S. Chowdhury, and S. Datta, "Scalable emulation of sign-problem-free hamiltonians with room temperature pbits," CoRR, vol. abs/1810.07144, 2019. [Online]. Available: http://arxiv.org/abs/1810.07144
[33] A. Z. Pervaiz, B. M. Sutton, L. A. Ghantasala, and K. Y. Camsari, "Weighted $p$-bits for fpga implementation of probabilistic circuits," IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 6, pp. 1920-1926, June 2019.


Naoya Onizawa (M'09) received the B.E., M.E. and D.E. degrees in Electrical and Communication Engineering from Tohoku University, Japan, in 2004, 2006 and 2009, respectively. He is currently an Assistant Professor in Research Institute of Electrical Communication at Tohoku University, and a JST PRESTO researcher, Japan. He was a postdoctoral fellow at University of Waterloo, Canada in 2011 and at McGill University, Canada from 2011 to 2013. In 2015, he was a Visiting Associate Professor at University of Southern Brittany, France. His main interests and activities are in the energy-efficient VLSI design based on asynchronous circuits and probabilistic computation, and their applications, such as brain-like computers.
He received the Best Paper Award in 2010 IEEE ISVLSI, the Best Paper Finalist in 2014 IEEE ASYNC, Kenneth C. Smith Early Career Award for Microelectronics Research in 2016 IEEE ISMVL, and the MEXT Young Scientists' Prize, Japan in 2020.


Kaito Nishino received the B. E. degree in Electrical and Electronic Engineering from Tohoku University, Japan, in 2017. He received the M. E. degree in communication engineering from Tohoku University, Japan, in 2019. His research interests include design of invertible logic circuits and its application to stochastic-computing LSI systems.


Sean C. Smithson (S'12) was born in Calgary, AB , Canada, in 1983. He received the B.Eng. (with distinction) and M.A.Sc. degrees in electrical engineering from Concordia University, Montreal, QC, Canada, in 2009 and 2014, respectively. He is currently pursuing the Ph.D. degree with McGill University, Montreal, QC, Canada.


Brett H. Meyer (SM'18) received the BS degree in electrical engineering, computer science and math from the University of Wisconsin-Madison, in 2003, and the MS and PhD degrees in electrical and computer engineering from Carnegie Mellon University, in 2005 and 2009, respectively. He is a Chwang-Seto faculty scholar and assistant professor in the Department of ECE, McGill University. After receiving the PhD , he worked as a post-doctoral research associate in the Computer Science Department, University of Virginia. He has been on the faculty at McGill, since 2011. His research interests are focused on the design and architecture of resilient multiprocessor computer systems. He is a Senior Member of the IEEE.


Warren J. Gross (SM'10) received the B.A.Sc. degree in electrical engineering from the University of Waterloo, Waterloo, Ontario, Canada, in 1996, and the M.A.Sc. and Ph.D. degrees from the University of Toronto, Toronto, Ontario, Canada, in 1999 and 2003, respectively. Currently, he is a Professor, Louis-Ho Faculty Scholar in Technological Innovation, and Chair of the Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada. His research interests are in the design and implementation of signal processing systems and custom computer architectures.

Dr. Gross served as Chair of the IEEE Signal Processing Society Technical Committee on Design and Implementation of Signal Processing Systems. He has served as General Co-Chair of IEEE GlobalSIP 2017 and IEEE SiPS 2017, and as Technical Program Co-Chair of SiPS 2012. He has also served as organizer for the Workshop on Polar Coding in Wireless Communications at WCNC 2017, the Symposium on Data Flow Algorithms and Architecture for Signal Processing Systems (GlobalSIP 2014) and the IEEE ICC 2012 Workshop on Emerging Data Storage Technologies. Dr. Gross served as Associate Editor for the IEEE Transactions on Signal Processing and as a Senior Area Editor. Dr. Gross is a Senior Member of the IEEE and a licensed Professional Engineer in the Province of Ontario.


Hitoshi Yamagata received the B.E., M.S., degrees from the Department of Electronic Engineering from Tohoku University, Sendai Japan in 1978 and 1980, respectively, and Ph.D. degree from the Department of Information Engineering from Tohoku University, in 1983. He joined the division of medical systems of Toshiba in 1983 and had developed full-digital X-ray system in since 1984. He was a postdoctoral research fellow at University California San Francisco and Davis from 1987 to 1990. He engaged in research and development of 1.5T MRI system, 3D ultrasound system as a system manager since 1991. He had managed software development of clinical applications for medical diagnostic imaging systems including CT, MRI and ultrasound since 2003. He had been a Senior Fellow since 2009. After CANON acquired the division in 2016, he has been a Fellow in CANON Medical Systems Corporation (CMSC) and is currently a member of Advanced Research Laboratory in CMSC.


Hiroyuki Fujita is Director of Advanced Research Laboratory of Cannon Medical Systems Corporation from 2017. He is also a Professor of Tokyo City University and Professor Emeritus of The University of Tokyo, where he served as a Professor in the Institute of Industrial Science over 38 years. He received the B.S., M.S. and Ph.D. degrees in Electrical Engineering from The University of Tokyo in 1975, 1977 and 1980, respectively. He is currently engaged in the investigation of MEMS/NEMS and applications to bio/nano technology and IoT.

He has published more than 300 academic papers and was a visiting professor in MIT and UC Berkeley. He received many awards including l'Ordre des Palmes Academiques from Government of France, Docteur Honoris Causa from Ecole Normale Superieure de Cachan, The Prize for Science and Technology -Research Category from Ministry of Education, Culture, Sports, Science and Technology, Outstanding Achievement Award from The Institute of Electrical Engineers of Japan and IEEE Robert Bosch Award for MEMS.


Takahiro Hanyu (SM'12) received the B.E., M.E. and D.E. degrees in Electronic Engineering from Tohoku University, Sendai, Japan, in 1984, 1986 and 1989 , respectively. He is currently a Professor and Vice Director in the Research Institute of Electrical Communication, Tohoku University. His general research interests include nonvolatile logic circuits and their applications to ultra-lowpower and/or highly dependable VLSI processors, and post-binary computing and its application to brain-inspired VLSI systems.
He received the Sakai Memorial Award from the Information Processing Society of Japan in 2000, the Judge's Special Award at the 9th LSI Design of the Year from the Semiconductor Industry News of Japan in 2002, the Special Feature Award at the University LSI Design Contest from ASPDAC in 2007, the APEX Paper Award of Japan Society of Applied Physics in 2009, the Excellent Paper Award of IEICE, Japan, in 2010, Ichimura Academic Award in 2010, the Best Paper Award of IEEE ISVLSI 2010, the Paper Award of SSDM 2012, the Best Paper Finalist of IEEE ASYNC 2014, and the Commendation for Science and Technology by MEXT, Japan in 2015. Dr. Hanyu is a Senior Member of the IEEE.


[^0]:    Manuscript received $x x x \mathrm{xx}$, xxxx ; revised $\mathrm{xxx} \mathrm{xx}, \mathrm{xxxx}$.
    N. Oinzawa K. Nishino, and T. Hanyu are with Research Institute of Electrical Communication, Tohoku University, Sendai, Miyagi, Japan 9808577 (e-mail: naoya.onizawa.a7@tohoku.ac.jp, hanyu@riec.tohoku.ac.jp)
    S. C. Smithson, B. H. Meyer, and W. J. Gross are with Department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada H3A 0E9 (e-mail: sean.smithson@mail.mcgill.ca, brett.meyer@mcgill.ca, warren.gross@mcgill.ca)
    H. Yamagata and H. Fujita are with Advanced Research Laboratory, Canon Medical Systems Corp., Japan (e-mail: hitoshi.yamagata@medical.canon, hiroyuki12.fujita@medical.canon)

    Copyright (c) 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

