a) warren.gross@mcgill.ca b) nonizawa@m.tohoku.ac.jp Received February 13, 2018; Revised June 11, 2018; Published October 1, 2018 Abstract: This paper reviews applications of stochastic computing in brainware LSI (BLSI) for visual information processing. Stochastic computing exploits random bit streams, realizing the area-efficient hardware of complicated functions, such as multiplication and tanh functions in comparison with binary computation. Using stochastic computing, we implement the hardware of several physiological models of the primary visual cortex of brains, where these models require such the complicated functions. Our vision BLSIs are implemented using Taiwan Semiconductor Manufacturing Company (TSMC) 65 nm CMOS process and discussed with traditional fixed-point implementations in terms of hardware performance and computation accuracy. In addition, an analog-to-stochastic converter is designed using CMOS and magnetic tunnel junctions that exhibit probabilistic switching behaviors for area/energy-efficient signal conversions to stochastic bit streams.
Introduction
Stochastic computing [1] represents information by a random sequence of bits, called a Bernoulli sequence and has been exploited for area-efficient hardware implementation. It was first introduced by von Neumann in 1950s [2] and had been fully developed in 1960s [3] . However, since then, it had not been well used, unlike traditional binary computation. In 2000s, stochastic computing has been applied to low-density parity-check (LDPC) decoders [4] , where LDPC codes are known as one of powerful error-correcting codes. The stochastic LDPC decoders exhibit powerful error-correcting capabilities with high area efficiencies [5] [6] [7] . Recently, it has been exploited for many applications, such as image processers [8] [9] [10] , digital filters [11] [12] [13] and MIMO decoders [14] .
In this paper, we review the applications of stochastic computing in brainware (brain-like) LSI. In our BLSI project, several hardware of brainware computing has been designed and implemented, including physiological models and deep neural networks. In Section 2, stochastic computing is briefly explained and the overview of brainware LSI (BLSI) is explained. Among several topics of our BLSI design, three hardware implementations are selected: the analog-to-stochastic converter (Section 3) [15] , the simple cell model of primary visual cortex in brains (Section 4) [16] [17] [18] , and the disparity energy model (Section 5) [19] . Section 6 concludes this paper.
Overview of Brainware LSI (BLSI)

Review of stochastic computing
Stochastic computing performs in probabilistic domain that the probabilities are represented by random sequences of bits. The probabilities are calculated by the frequency of ones or zeros in the sequence that can be represented by many different sequences of bits. For example, different sequences of bits (1011) and (1011) mean the same probability. There are two mappings for stochastic bit sequences: unipolar and bipolar coding. For a sequence of bits, a(t), denote the probability of observing a '1' to be P a = P r (a(t) = 1). In unipolar coding, the represented value, A, is A = P a , (0 ≤ A ≤ 1), while, in bipolar coding, the represented value, A, is A = (2 · P a − 1), (−1 ≤ A ≤ 1).
Stochastic circuit components are summarized in Fig. 1 . Figure 1 (a) shows a two-input multiplier in unipolar coding realized using a two-input AND gate. The input and output probabilities are represented using N sto -bit length streams, where N sto is 10 in this example. N sto clock cycles are required to complete a multiplication of binary computation, where the computation accuracy depends on N sto . Figure 1(b) shows a stochastic multiplier in bipolar coding realized using a two-input XNOR gate. A two-input scaled adder is realized using a two-input multiplexer shown in Fig. 1(c) . P s is a probability of selecting one of two inputs.
In stochastic computing, hyperbolic tangent and exponential functions are simply realized using finite state machines (FSMs), as shown in Figs. 1(d) and (e), respectively. In the FSM-based functions, the states transit to the right, if the input stochastic bits, x(t), are "1" and the states transit to the left, otherwise. The output stochastic bit, y(t), is determined by the current state. The stochastic tanh function, Stanh, in bipolar coding is defined as follows: 
where N T is the total number of states. The average values of the output bit streams are approximated to the outputs of the tanh function. The stochastic exponential function, Sexp, is defined in unipolar coding as follows:
where N E is the total number of states and G determines the number of states generating outputs of "1". In order to design stochastic circuits with traditional binary circuits, signal converters are required between stochastic bit streams and binary data. Figure 2 (a) shows a binary-to-stochastic converter (B2S) including a digital comparator and a linear-feedback shift register (LFSR) [20] . In B2S, nbit binary signals are compared with n-bit random signals generated using the LFSR to generate stochastic bit streams. Figure 2 (b) shows a stochastic-to-binary converter (S2B) in unipolar coding designed using a binary counter. In S2B, the number of "1" of stochastic bit streams is counted in the counter and the stored values are binary data converted. In bipolar coding, absolute values in the counters need to convert to two's complement values in order to deal with the sign bit.
Brainware LSI based on stochastic computing
Recently, brain-inspired computing, such as TrueNorth [21] and deep learning [22] , has been actively studied for highly accurate recognition and classification capabilities, like human brains. Several hardware implementations of brain-inspired computing have been presented in [23, 24] , but the energy efficiencies of the current hardware are significantly lower than that of human brains. Since 2014, in our BLSI project, we exploit stochastic computing to design the energy-efficient brainware hardware based on physiological models of brains. The reason to choose stochastic computing for BLSI is that human brains can work well under severe noises and errors. Although stochastic computing generally causes errors due to randomness, BLSIs based on stochastic computing would work well, like human brains. Actually, a large-scale neuromorphic chip based on stochastic computing has been reported and works well under noises [25] .
Out stochastic BLSIs are summarized in Fig. 3 . This figure shows flows of visual information in human brains. First, electrical signals (information) from retinas are sent to the primary visual cortex (V1) through the lateral geniculate nucleus (LGN). Then, in V1, information are extracted and the extracted information are distributed to two pathways: dorsal pathway to the middle temporal (MT) and ventral pathway to the inferior temporal (IT).
In this paper, the hardware of several physiological models designed using stochastic computing are reviewed. First, analog-to-stochastic converters are designed to convert external analog signals to stochastic bit streams [15] in Section 3. Second, a 2D Gabor filter that shows similar responses of simple cells of V1 is designed and fabricated using TSMC 65 nm CMOS technology [16] [17] [18] in Section 4. Third, a disparity energy model in V1 is implemented, exhibiting the relative depth estimations using two cameras, like human brains [19] in Section 5. Other than the three topics that are not reviewed in this paper, the BLSI applications of stochastic computing have been studied for deep neural networks [26] and auditory signal processing [27] .
3. Analog-to-stochastic converter using magnetic tunnel junction (MTJ)
Vision chip using analog-to-stochastic converter
In this section, an analog-to-stochastic converter using a magnetic tunnel junction (MTJ) device is explained for massively parallel vision chips [15] . The vision chips are front-end image processors for feature extractions in cognitive computing as shown in Fig. 4(a) , where the analog-to-stochastic con- verter is used in the signal-conversion block. The MTJ devices [28] are often exploited as a non-volatile memory that stores one-bit information as a resistance and are often exploited for MRAMs [29] . In addition, as the switching behaviors between the two different resistances of MTJ devices are probabilistic [30, 31] , the probabilistic behaviors can be exploited for random number generators [32, 33] and analog-to-stochastic converters. Figure 4 (b) shows a conventional circuit structure of the analog-to-stochastic converter. Using only CMOS transistors, first, an analog-to-digital converter is used to convert from analog to digital signals that are then converted to stochastic bit streams using a digital-to-stochastic converter (binary-tostochastic converter). In the conventional circuit, the power dissipation of the ADC can be a large portion of the total power dissipation in an image sensor (e.g. 65% in [34] ) and the digital-to-stochastic converter tends to be large in the stochastic circuits. To reduce the overhead of the signal conversion block, an analog-to-stochastic converter is designed that the analog signals are directly converted to the stochastic bit streams as shown in Fig. 4 (c). Figure 5 (a) illustrates the proposed analog-to-stochastic converter using the hybrid MTJ/CMOS devices. It consists of a pulse-signal generator, a random bit generator, a counter, and a probability controller. The two parameters, t (pulse width in time) and V bias , are set in the calibration step before using the converter in order to compensate variabilities of MTJ and CMOS devices. Suppose that an analog current signal is received in a logarithm image sensor that realizes a high dynamic range [35, 36] . The random bit generator is designed using three transistors and one MTJ device.
Circuit design using hybrid MTJ/CMOS devices
The switching behavior of the MTJ device between low resistance (R P (parallel)) and high resistance (R AP (anti parallel)) is probabilistic [30, 31] as illustrated in Figs. 5(b) and (c). Suppose that the initial state of the MTJ device is R P . When the analog voltage signal, V ph , is generated from the sensor, a write current signal, I W , is applied during t. In this case, the switching probability of the MTJ device, p w , is approximated [30, 31] as follows :
where τ p is the switching time constant. The detailed switching behavior is described and modelled in the SPICE model [37] used in this paper. Figure 6 shows the circuit operations of the proposed analog-to-stochastic converter. The converter iteratively operates at one of three phases: write, set, and erase. First, in the write phase, I write is generated to probabilistically switch the MTJ device at a probability depending on V ph . Second, in the set phase, the read current, I read , is generated to read the MTJ resistance, and the output voltage, V R , is determined as follows:
where R MT J is the resistance of the MTJ device. V R is stored in the latch next to the converter as shown in Fig. 5(a) . Finally, in the erase phase, the erase current, I erase , is generated to switch the resistance back to R P . After the erase phase, the phase is back to the write phase.
Simulation results
Figure 7(a) shows simulated waveforms of the proposed analog-to-stochastic converter using NS-SPICE in 90 nm CMOS and the MTJ model [37] . The hybrid 90 nm CMOS and MTJ process is the same as that used in a fabricated chip of [38] . NS-SPICE is a transistor-level simulator that can handle both the transistors and the MTJ models. The cycle time of the converter is set to 10 ns for generating a random bit, where the write phase is 5 ns, and the set phase is 1 ns, and the erase phase is 4 ns. In the write phase, there is a write current, I write , during 4.73 ns and no current during 0.27 ns. I write , is 236 μA corresponding to the switching probability, p w , of 50% at room temperature. In this simulation, the proposed converter generates three random bits. At the first and the second trials, the resistance of the MTJ device is changed from R P to R AP in the write phase. Hence, the output of the converter, V OU T is "0". In contrast, at the third trial, the resistance of the MTJ device is not changed even if I W is applied to the MTJ device, leading to V OU T of "1". Figure 7 (b) shows a monte-carlo simulation result of the proposed analog-to-stochastic converter in the write phase. The number of trials is 100 and I write is 236 μA corresponding to p w of 50%. The simulation waveforms show that the switching behavior of the MTJ device is probabilistic and the switching timing is random. In this simulation, the resistance of the MTJ device is changed from R P of 1 kΩ to R AP of 3 kΩ at 50% after writing a bit to the MTJ device. Figure 8 (a) shows a relationship between the switching probability, p w and the input current, I ph , when V bias is 0.4 V and I write is 236 μA. The attempt time, t, varies from 1 to 10 ns. When t is 4.73 ns, the relationship between p w and I ph is almost linear, realizing the linear analog-to-stochastic conversion. In addition, the MTJ variabilities are considered as shown in Fig. 8(b) . In order to compensate the MTJ variability, two parameters, V bias and t, are set in the calibration step. The resistance variability is defined by ΔR. To control both V bias and t, the relationships between the switching probabilities and the I ph are almost linear under the MTJ variability.
Stochastic configurable 2D Gabor-filter chip
Review of Gabor filter
Gabor filters [39] are powerful feature-extraction tools that extract oriented bars and edges of images. They have been applied for various image processing and computer vision applications, such as face recognition [40] and vehicle verification [41, 42] . The 2D Gabor function (odd phase) is defined as follows:
where x = x cos θ + y sin θ and y = −x sin θ + y cos θ. ω represents the spatial angular frequency of the sinusoidal factor. θ represents the orientation of the normal to the parallel stripes of a Gabor function. σ is the sigma and γ is the spatial aspect ratio of the Gaussian envelope. The 2D Gabor filters exhibit similar responses of simple cells in primary visual cortex (V1) of human brains as shown in Fig. 9 . In V1, many different simple cells activated with specific spatial frequencies and angles of images are placed as the hypercolumn structure. Based on the hypercolumn structures, human brains can extract many different features, such as edges and lines of images for object recognitions and classifications in the latter part of brains. HMAX model is known as one of the brain-inspired object recognition models using Gabor filters [43] . Figure 10 shows a hardware architecture of the proposed 64 parallel stochastic configurable 2D Gaborfilter chip. The input image sizes are VGA (640 x 480) with grayscale. As stochastic computation takes N sto clock cycles to complete one computation based on traditional binary implementation, the parallel structure is exploited to hide long computation cycles. 8-bit input signals (pixels) from grayscale images are stored in the line buffer and are then transferred to one of the 64 parallel stochastic convolution units. In this chip, there are three cases of N sto : 64, 128, and 256. In the convolution block for Gabor filtering, the multipliers are realized based on stochastic computing and the adders are designed based on traditional binary computation. The hybrid circuit achieves a better computation accuracy than the purely stochastic circuit with an acceptable area overhead [44] . Figure 11 (a) shows the block diagram of the stochastic Gabor coefficient generator. The coefficient generator is designed based on the stochastic Gabor (SGabor) function defined as follows:
Hardware architecture
where ω is a constant angular frequency and λ is ω/ω and λ π is constant. The original Gabor function on Eq. (5) is approximated as follows:
where α is a constant value for fitting SGabor with the original Gabor function. The stochastic sin function, Ssin, is designed using five Stanh functions based on [16] as shown in Fig. 11(b) . Figure 11(b) shows the example with ω = π. ω required is controlled by λ. The stochastic cos function, Scos, is designed as well as Ssin. Figure 12 shows simulated Gabor functions using SGabor for a kernel size of 51x51 with different configurations using MATLAB. The length (cycle) of stochastic bit streams for SGabor is defined as N sto . In this simulation, ω and θ are changed with N sto = 2 18 . Using SGabor, any ω and θ can be configured depending on requirements. Figure 13 shows the test environment of of the proposed stochastic 2D configurable Gabor-filter chip using TSMC 65 nm CMOS process. The proposed circuit is designed using Verilog HDL and the chip layout is obtained using Synopsys Design Compiler and Cadence SoC Encounter. The supply voltage is 1.0 V and the area is 1.79 mm × 1.79 mm. The fabricated chip is tested with an FPGA (Digilent Genesys 2) board. Images are captured by a camera (VGA) and the input pixels in grayscale are transferred to the chip through the FPGA. The output pixels of the test chip are sent back to the FPGA and are displayed using the FPGA. Table I shows performance comparisons of the proposed stochastic Gabor filter with related works. It is hard to directly compare the performance because they are designed with different functionalities and configurations. The memory-based methods [45, 46] use fixed coefficients with fixed kernel sizes that are calculated in software in advance, causing the lack of flexibility. As opposed to the memorybased circuits, in the conventional configurable Gabor filter [47] , CORDIC is exploited to dynamically generate the coefficients related to sinusoidal function for flexible Gabor filtering. However, this method is low throughput due to the hardware complexity and several parameters need to be stored in memory, losing the power-gating capability. In contrast, the proposed memory-less circuit achieves an order-of-magnitude higher throughput than the conventional configurable Gabor filter with the power-gating capability, leading to zero standby power.
Simulation and measurement results
Stochastic disparity energy model
Review of disparity energy model
Measuring the relative depth of objects efficiently in real-time is a crucial issue as advances in robotics.
A disparity-energy model was presented to express the disparity-selective properties of binocular complex cells in V1 that are responsible for depth perception in brains [48] . In the disparity-energy model, binocular disparity measures the depth of objects using two images taken from different vantage points, and is defined as the difference in horizontal positioning of the same object in these two images. This model was used to be valid in monkeys [49] and to describe well the response of binocular complex cells in V1 [50] . When an object is perceived from the left and right eyes, its position is horizontally displaced in each of the corresponding images, as illustrated in Fig. 14(a) . The brains use this horizontal disparity, d, to estimate the relative depths of objects in three dimensions. Positive and negative disparities (corresponding to farther and closer objects) consequently excite different retinal cells in each eye.
Zero disparity corresponds to those objects whose positions are the same from both perspectives and excites corresponding retinal cells in each eye. Figure 14 (b) shows the disparity-energy model that shows how the neural hierarchy in the brain processes this information to detect disparity [48, 51, 52] . The simple cells are approximated using Gabor filters explained in the prevision section. The complex cells C d then take the even and odd binocular cell responses and squares and adds them:
x L and x R are the horizontal pixel positions for the left and right eye, respectively. G even+ (x) = G even (x) if x > 0, and 0 otherwise. G even− (x) = G even (x) if x ≤ 0, and 0 otherwise. There are two ways of encoding disparity in the model: position shift and phase shift [53] . In this paper, we only use position shift, where d is defined by the difference in position of the receptive field.
Stochastic convolution architecture
Key circuit components for designing the disparity energy model are convolution units used in Gabor filtering. The convolution is defined as follows:
where a i is the coefficients and x i is the system inputs. Figure 15 (a) shows a conventional stochastic architecture of convolution units. It consists of AND gates (stochastic multiplier) and a multiplier (stochastic scaled adder). The drawback of this circuit is that the computation accuracy is significantly lower when the number of inputs, n, is increased. In the Gabor filters of the disparity energy model, 7×7 kernel sizes are used to extract features. In this case, n is 49, causing a low computation accuracy. To achieve a high computation accuracy with a large number of inputs, the exponential based convolution circuit was presented as shown in Fig. 15(b) . In the proposed circuit, the exponential compression method transforms the stochastic streams of interest using an exponential function, such that additions become multiplications [54] . The exp(x) and ln(x) functions are approximated using Taylor series expansions. Suppose that a i and x i have been properly scaled such that |x i | ≤ 1 and |a i | ≤ 1. The set a i of coefficients is partitioned into a set a i + containing the positive coefficients, and a set a i − containing the absolute values of the negative coefficients.
Experimental results of disparity energy model
To detect the depths of objects, an experiment is setup that is similar to [50] as shown in Fig. 16 . The two cameras are setup 19-cm apart, where 8-degree angle from the vertical is realized. The fixation point that is the point at the intersection of the line of sight of each camera is 66 cm away. At this range, disparities correspond to around 3 cm per pixel. One white pole is placed on the fixation point. To detect disparities of −8 and +8, two white poles are also placed at a distance of 42 and 90 cm, respectively, from the cameras center. Figure 17 shows the disparity maps for the floating-point, the conventional stochastic and the proposed stochastic circuits. The lengths of stochastic bit streams are 2 6 − 1 corresponding to a 6-bit fixed-point precision. To quantify the errors, we obtain 4 additional image pairs with poles at different disparities using a similar setup and manually create ideal disparity maps depending on the position and dimensions of the poles from the left and right images to estimate the error. Using the conventional stochastic circuit, the disparities are not obtained, unlike the floating-point result.
The reason is that the computation accuracy of the stochastic convolution unit shown in Fig. 15(a) . is significantly lower than the floating point. In contrast, using the proposed stochastic circuit, the similar disparities to the floating-point results is obtained because of the high computation accuracy of the exponential based convolution unit. Table II summarizes the performance of disparity-energy-model hardware using TSMC 65 nm CMOS technology. For both fixed-point and stochastic circuits, a 2D 1×100 architecture is synthesized using Cadence RC compiler. The worst-case delay is 5.5 ns and 1.7 ns in the fixed-point and the stochastic circuits, respectively. In the fixed-point design, the interface circuitry includes the input and output registers. In the stochastic design, it includes input registers, linear feedback shift registers (LFSRs) for random number generation, comparators and counters to convert from digital to stochastic domain and back.
Hardware evaluation
To provide a fair comparison, we use the area × delay product (ADP) measure to normalize for latency of the stochastic system. Note that such a stream length allows outperforming the floatingpoint system even when the performance is averaged over the seed configurations. The stochastic circuit with the interface circuitry achieves a 41.3% reduction in ADP in comparison with the fixedpoint circuit.
The dynamic and the static power dissipations of the stochastic design are significantly smaller than that of the fixed-point design because of the small area. However, the energy dissipation with the interface is 73% larger than the fixed-point design. The reason is the stochastic circuits take 2 6 −1 cycles for a one-cycle operation of the fixed-point design. The energy overhead also comes from the interface that includes binary-to-stochastic and stochastic-to-binary converters. The overhead can be mitigated using MTJ-based converters explained in Section 3.
The average error of the stochastic design is slightly smaller than that of the fixed-point design. As the stochastic circuits exhibit the variability of computation accuracy depending on random bit streams, the minimum and the maximum computation accuracies are also listed.
Conclusion
In this paper, we have reviewed the applications of stochastic computing in brainware for visual signal processing. The two physiological models in V1 of the human brains have been implemented in TSMC 65 nm CMOS process. The hardware performance is compared and discussed with that of the fixed-point design with the computation accuracy. In addition, the area-efficient analog-to-stochastic converter has been designed in order to mitigate the signal-conversion overhead to the stochastic bit streams from external analog signals.
Future prospect includes the application of stochastic computing for models of higher order visual cortex, such as visual attention models.
