Abstract. A time-domain analog-weighted-sum calculation model based on a pulse-width modulation (PWM) approach is proposed. The proposed calculation model can be applied to any types of network structure including multi-layer feedforward networks. We also propose very large-scale integrated (VLSI) circuits to implement the proposed model. Unlike the conventional analog voltage or current mode circuits used in computing-in-memory circuits, our time-domain analog circuits use transient operation in charging/discharging processes to capacitors. Since the circuits can be designed without operational amplifiers, they can be operated with extremely low power consumption. However, they have to use very high-resistance devices, on the order of giga-ohms. We designed a CMOS VLSI chip to verify weighted-sum operation based on the proposed model with binary weights, which realizes the BinaryConnect model. In the chip, memory cells of static-random-access memory (SRAM) are used for synaptic connection weights. High-resistance operation was realized by using the subthreshold operation region of MOS transistors unlike the ordinary computing-in-memory circuits. The chip was designed and fabricated using a 250-nm fabrication technology. Measurement results showed that energy efficiency for the weightedsum calculation was 300 TOPS/W (Tera-Operations Per Second per Watt), which is more than one order of magnitude higher than that in state-of-the-art digital AI processors, even though the minimum width of interconnection used in this chip was several times larger than that in such digital processors. If state-of-the-art VLSI technology is used to implement the proposed model, an energy efficiency of more than 1,000 TOPS/W will be possible. For practical applications, development of emerging analog memory devices such as ferroelectric-gate field effect transistors (FeFETs) is necessary.
Introduction
Artificial neural networks (ANNs), such as convolutional deep neural networks (CNNs) [12] and multi-layer perceptrons (MLPs) [3] , have shown excellent performance on various tasks including image recognition [3, 11, 5, 27, 13] . However, computation in ANNs is very heavy, which leads to high power consumption in current digital computers and even in highly parallel coprocessors such as graphics processing units (GPUs). In order to implement ANNs at edge devices such as mobile phones and personal service robots, operation at very low power consumption is required.
In ANN models, weighted summation, or multiply-and-accumulate (MAC) operation, is an essential and heavy calculation task, and dedicated complementary metal-oxide-semiconductor (CMOS) very-large-scale integration (VLSI) processors have been developed to accomplish it [26, 20, 25, 10, 2] . As an implementation approach other than digital processors, use of analog operation in CMOS VLSI circuits is a promising method for achieving extremely low-power consumption for such calculation tasks [6, 14, 19, 17] . In particular, computing-inmemory approaches, which achieve weighted-sum calculation utilizing the circuit of static-random-access memory (SRAM), have been popular since around 2016 [18] .
Although the calculation precision is limited due to the non-idealities of analog operation such as noise and device mismatches, neural network models and circuits can be designed to be robust to such non-idealities [21, 9, 7] . On the other hand, ANN models with binarized weights or even with binarized inputs have been proposed and their comparable performance has been demonstrated, mainly in applications of image recognition [4, 8] . These models facilitate the development of energy-efficient hardware implementations [19] .
The time-domain analog weighted-sum calculation model was originally proposed based on mathematical spiking neuron models inspired by biological neuron behavior [15, 16] . We have simplified this calculation model under the assumption of operation in analog circuits with transient states, and call its VLSI implementation approach "Time-domain Analog Computing with Transient states (TACT)." In contrast to conventional weighted-sum operation in analog voltage or current modes, the TACT approach is suitable for operation with much lower power consumption in the CMOS VLSI implementation of ANNs.
We have already proposed a device and circuit that performs time-domain weighted-sum calculation [23, 28, 22] . The proposed circuit consists of plural input resistive elements and a capacitor (RC circuit), which can achieve extremely low-power operation. The energy consumption could be lowered to the order of 1 fJ per operation, which is almost comparable to the calculation efficiency in the brain, as long as weighted-sum operation is considered. We also proposed a circuit architecture to implement a weighted-sum calculation with differentsigned weights with two sets of RC circuits, one of which calculates positively weighted sums while the other calculates negatively weighted sums [29, 30] . Using a similar time-domain approach, a vector-by-matrix multiplier using flash memory technology was proposed [1] . Fig. 1 . Weighted-sum calculation using current sources switched with PWM signals.
Weighted-sum calculation circuits using pulse-width modulation (PWM) signals have previously been proposed [24] . In this paper, we reformulate the weighted-sum calculation model based on the time-domain analog computing approach using PWM signals, called the TACT-PWM approach, and propose its applications to ANNs such as MLPs and CNNs with extremely high computing energy efficiency. We also show the design and measurement results of an ANN VLSI chip fabricated using a 250-nm CMOS VLSI technology, in which the calculation results by the proposed model are compared with the ordinary numerical calculation results and verify its very high computing efficiency.
Time-domain weighted-sum calculation circuit model with PWM signals
The basic circuit configuration based on the TACT-PWM approach is shown in Fig. 1 . Corresponding to input signals S i ∈ {0, 1} in the voltage domain, each switched-current source (SCS) outputs current I i when S i = 1. An SCS can be replaced by a resistor and a diode if the nonlinearity in charging characteristics can be ignored. The total charge amount Q stored at the node of capacitor C charged by N SCSs with inputs S i , each of which has pulse width of W i , is expressed by
where Q can be considered as the weighted-sum calculation result with weight I i and input W i . The node voltage of C, V c , is given by V c = Q/C. If I i ≥ 0, the energy consumption E of this charging and discharging process is given by E = CV c V dd (V dd is a supply voltage of SCSs), where the energy for charging the input capacitance of SCSs is not included. The weighted-sum calculation circuit and a timing diagram of its operation are shown in Fig. 2 . Here, we consider this operation as a weighted-sum calculation with the same signed weighting. The circuit consists of a weighted-sum calculation or MAC part and a voltage-pulse conversion (VPC) part. The MAC part consists of SCSs corresponding with inputs, which is accompanied by parasitic wiring capacitance C d . The VPC part consists of an SCS, two switches, and a comparator with an input capacitance C n . Since the parasitic capacitances C d and C n are inevitably included in the circuit, to minimize the energy consumption for the operation, the charged capacitance C, which is equal to C d + C n , should be as small as possible.
The PWM inputs are given in the input period T in ; ∀i, W i ≤ T in , which is arbitrarily determined. If the node voltage V c at the timing of the end of this input period is denoted by V mac ,
In the VPC part, the output PWM signal S out with pulse width W out is generated during the output period T out . In this operation, capacitance C is charged up by the SCS with current I n . To minimize the energy consumption in this operation, the VPC part can be separated from the MAC part by S n , and only C n can be charged up to the threshold voltage V θ of the comparator. In this case, to meet the condition that 0 ≤ W out ≤ T out , the current I n is given by
which means that the node voltage V n increases with the slope of V θ /T out . When V n > V θ , the comparator output S out = 1, and after the end of output period V n is reset by S rst at the resting state, which is usually zero. Thus, the pulse width of the output signal as a result of weighted-sum calculation is given by
where it is assumed that 0
If the same input line structures are used regarding the positive and negative weights, the denominator of Eq. (5) is common, Thus, positive and negative weighted calculations are performed separately in the different lines, and by subtracting W out for negative weighing from that for the positive one, the total calculation result is obtained as follows:
where W ± out are the pulse widths of output signals with positive and negative weighting, respectively. Since the obtained result can be fed into the next circuit corresponding to the next layer of the network via nonlinear transform operation, calculations for ANNs can be achieved. The total energy consumption for the MAC calculation is expressed as follows:
where E mac and E vpc are the energy consumptions of the MAC and VPC parts, E i and E n are those for the switching of the SCS at each MAC part i and for the switching of the SCS at the VPC part, respectively, and P cmp (t) is the power consumption of the comparator.
CMOS BinaryConnect network circuit based on TACT-PWM approach
On the basis of our TACT-PWM circuit approach, a CMOS circuit using an SRAM cell array structure is shown in Fig. 3(a) . This circuit implements a BinaryConnect neural network, which uses analog input values while weights are binary [4] . This circuit consists of a synapse part and a neuron part. The synapse part consists of an SRAM cell array, and each synapse circuit operates as two MAC circuits. Unlike the ordinary SRAM circuits proposed in the concept of computing-in-memory, our SRAM cell circuit outputs very low current on the order of nano-amperes to guarantee the time constant in the TACT approach [29, 30] , and therefore the p-type MOS field effect transistors (pMOSFETs) M ± supply subthreshold currents to dendrite lines D ± based on the input from axon lines A i , where axon and dendrite are neuroscientific terms in the biological neuron.
In the neuron part, two VPC circuits perform positive and negative weighting calculations, respectively, and the subtraction result is fed into a rectified-linearunit (ReLU) function circuit. A detailed explanation follows.
Synapse part
In the synapse part, each SRAM cell shown in Fig. 3(b) , which is called here a binary synapse unit (BSU), performs binary weighting, when receiving an input pulse S i as the gate voltage of the pMOSFET M ± to make it operate in the subthreshold region. To perform this operation, it is necessary that the SRAM cell be set at a 0 or 1 state based on the training result in a BinaryConnect network.
The BSU has three functions: one-bit memory, a switched current source, and a selector. The one-bit memory function is achieved at the flip-flop, which stores the binary weight w i ∈ {+1, −1} by setting voltages V + P and V − P , as follows:
where V dd is the supply voltage. The switched current source with a selector is realized by pMOSFETs M ± that are connected to dendrite lines D ± , respectively. Since pMOSFETs M ± operate in the subthreshold region, their drain currents I ± i are expressed as follows: 
Neuron part
In the neuron circuit, dendrite lines are initialized and reset at ground level by S rst before inputting signals S i to the synapse part. Next, input PWM signals are given during input time period T in , and capacitance C di and C n are charged. Then, dendrite lines are separated by neuron parts with S n . At the same time, the current source I n is connected to capacitance C n , and thus C n is charged. When the node voltage of C n , V ± n , reaches the threshold voltage of the comparator, the output signal S ± out is generated. A set of output signals S ± out are fed into the ReLU function circuit, which simply consists of logic circuits, as shown in Fig. 3(c) , and the output PWM signal is only generated when W + out > W − out , as shown in Fig. 3(d) .
VLSI chip design and measurement results
Using TSMC 250 nm CMOS technology we designed and fabricated a CMOS VLSI chip of our neural network circuit with ten neurons each of which has 100 synapses. The layout results and microphotographs are shown in Fig. 4 . Measurement results of the input-output relationship in weighted-sum calculations operations at one neuron with 100 synapses are shown in Fig. 5 . As shown in Fig. 5(a) , weighted-sum operation was approximately achieved and sufficient linearity was obtained. From Fig. 5(b) , the deviations in the time domain are ±20 ns, and this means that the precision of the calculation is about ±1 % because of the maximum pulse width being 2 µs. However, an offset and scattering of weighting are clearly observed in Fig. 5(a) . These nonidealities are due to variations in the threshold voltages of MOSFETs operating in the subthreshold region in BSUs. Such variations can be compensated for by adjusting the threshold voltages if analog memory devices such as ferroelectric-gate FETs are used in BSUs.
Measurement results of the output pulse width as a function of weightedsum calculation results followed by the ReLU function in one neuron with 100 synapses are shown in Fig. 6 . The average error was 1.5 %, and the maximum error was about 8 %. This error can be decreased by adjusting the deviations of the threshold voltages of MOSFETs operating in the subthreshold region.
The measurement conditions and results for the power efficiency of the fabricated VLSI chip are shown in Table 1 . The power efficiency obtained from the measurement was 300 TOPS/W (Tera-Operations Per Second per Watt), which is about 30 times higher than that of state-of-the-art digital AI processors, while the minimum feature size of the VLSI fabrication technology used was around 10 times larger than that in the digital AI processors. Therefore, if we used the same VLSI fabrication technology as in the digital AI processors, we could obtain a power efficiency of more than 1,000 TOPS/W or 1 POPS/W (Peta-OPS/W).
Conclusions
In this paper, we proposed a time-domain weighted-sum calculation model based on the TACT-PWM approach with an activation function of ReLU. We also proposed VLSI circuits based on the TACT approach to implement a calculation model with extremely low energy consumption. A high energy efficiency of 300 TOPS/W was achieved by the fabricated CMOS VLSI circuit with binary weights using 250-nm CMOS VLSI technology. If we use a more advanced VLSI fabrication technology, which achieves lower parasitic capacitance, the energy efficiency will be further much improved to over 1,000 TOPS/W. However, the fabricated circuit had insufficient calculation precision, which is mainly due to the characteristic variations of subthreshold operation in MOSFETs. To improve the calculation precision and compensate for such variations, it is necessary to introduce analog memory devices.
As for the neuron parts, the measurement results of the fabricated VLSI chip suggest that the energy consumption of this part is comparable to that of the whole synapse part with 100 inputs. Therefore, it is also necessary to redesign a comparator circuit with much lower power consumption to improve the energy efficiency of the whole calculation circuit. 
