Renyuan ZHANG †a) , Takashi NAKADA † , Members, and Yasuhiko NAKASHIMA † , Fellow SUMMARY A programmable analog calculation unit (ACU) is designed for vector computations in continuous-time with compact circuit scale. From our early study, it is feasible to retrieve arbitrary two-variable functions through support vector regression (SVR) in silicon. In this work, the dimensions of regression are expanded for vector computations. However, the hardware cost and computing error greatly increase along with the expansion of dimensions. A two-stage architecture is proposed to organize multiple ACUs for high dimensional regression. The computation of high dimensional vectors is separated into several computations of lower dimensional vectors, which are implemented by the free combination of several ACUs with lower cost. In this manner, the circuit scale and regression error are reduced. The proof-of-concept ACU is designed and simulated in a 0.18 µm technology. From the circuit simulation results, all the demonstrated calculations with nine operands are executed without iterative clock cycles by 4960 transistors. The calculation error of example functions is below 8.7%.
Introduction
The road-map of very large scale integrated (VLSI) circuit scaling-down is approaching the end due to the physics bound. The application engineers will hardly harvest the benefit on high performance computing from higher integration and frequency in the near future. Meanwhile, the demands on big (even huge) data processing keep increasing. In this sense, more scalable and efficient computing systems are expected [1] . The approximate computing technology, which allows a reasonable loss of accuracy but achieves high speed and low power with lower hardware cost, is considered as a promising option of efficient computing systems [2] , especially on the edge devices of internet of things (IoTs) [3] . By applying the approximate computing technologies, many works have reached the rich trade-off between precision and implementation cost in various real-world tasks such as pattern recognition [4] and machine learning [5] . In those works, it was found that the acceptable loss of computational accuracy has not significant impacts on the final performance of entire tasks.
In order to approximately but efficiently process massive data, efforts on both sides of software and hardware were made for different benefit features [6] . One of typical strategies is to carry out calculations by using inexact Manuscript or faulty circuits [7] . Obviously, the common concern is that the impact of cost reduction must compensate (or more commonly, over compensate) the loss of precision. Some conventional analog calculating circuits have been proved sufficiently fast, low power, and compact for specific functions such as multiplication [8] . However, the benefit of those function-specialized analog calculators is easily eaten up by the very poor programmability. More generally, the programmable and function flexible calculators are still demanded. The simplest methodology of general purpose approximate calculators is reducing the big-width of conventional digital processors [9] , which hardly breaks through the trade-off between gain and loss. For greatly speeding up flexible calculations, the target functions can be retrieved by the look-up table (LUT) with analog addressing [10] . When the efficient memory blocks such as multi-valued logic (MVL) memories are employed, the LUT-based calculators are even scaled down in further [11] . However, the increment of operands leads to the explosion of LUTs. Therefore, those LUT-based approximate calculators are difficult to use for simultaneous vector calculations. An attempt to avoid LUT explosion was made by approximately retrieving functions with neural network (NN) on chip instead of the LUT with full patterns [12] . Although NNs are helpful to reduce hardware cost, only several simple functions are available due to the limitation of a small scale of NN. From our early study [13] , it is feasible to retrieve arbitrary complex functions in continuous-time and compact circuit scale by implementing advanced regression algorithms such as support vector regression (SVR). A programmable analog calculation unit (ACU) with only 600 transistors was presented for various computations with two analog operands. In this work, the proposed ACU is expended to vector calculations. The increment of inaccuracy and hardware cost is illustrated when the number of operands increases. Since the calculations are carried out by regressing the listed samples through SVR, the accuracy and circuit scale strongly depend on the number of support vectors (SVs, known as significant samples). Unfortunately, even if the proposed efficient scheme of SVR is applied, it is very challenging to reduce the number of SVs in further if the operands are too many. Therefore, a two-stage architecture is developed to separate a complex multi-variable function into several groups with fewer variables. The free combination of ACUs is implemented to address specific target functions. In this manner, the high dimensional regression is implemented by a compact analog circuitry with improved accuracy. A proof-Copyright © 2019 The Institute of Electronics, Information and Communication Engineers of-concept ACU with nine operands is designed and simulated in a 0.18µm standard CMOS technology along with the interfaces between two stages and the special configuration of SVR training. The entire ACU consists of 4960 transistors. From the circuit simulation results, all the example functions including 3 × 3 convolutions and square-root with nine variables are computed in continuous-time with the calculation errors below 8.7%.
Prototype of Programmable ACU
A programmable ACU approximately computes arbitrary functions with N analog inputs without any iteration of clock cycles. The SVR algorithm with Gaussian kernel [14] is realized by VLSI circuits to retrieve complex functions through N dimensional regression as shown in Fig. 1 . A set of sample patterns including variables and correct function values are necessary for training the regression network. After the training process, the well organized regression network is expected to "predict" the function value when the test patterns (variables) arrive. Giving a set of N dimensional sample vectors (X 1 , z 1 ), (X 2 , z 2 ), · · · , (X n , z n ), where X i ∈ R N is the sample vector and z ∈ R is the target function value, the regression for an unknown variable X is expressed by:
through the fundamental of SVR theory. The training process of SVR is to identify support vectors (SVs, those with non-zero parameters (α − α * )) and corresponding parameters, which is out of the scope of this paper. The LIBSVM platform [15] is employed to train the SVR network. Obviously, more SVs lead to the high accuracy but increase the hardware cost. Thus, an efficient scheme of SVR is proposed to reasonably reduce the number of SVs.
Efficient Scheme of SVR
The retrieval of target function is a special application of regression. The learning samples are generated from the pre-computation of the specific target functions. Namely, the resolution of sampling is alterable, but leads to the tradeoff between regression accuracy and costs. In the task of this work, the number of samples greatly impacts the number of SVs which address particular circuity sets to process them. The efficient scheme of SVR is expected to achieve reasonably high accuracy with reduced SVs. A multi-round training process is introduced to evaluate each SV and eliminate insignificant SVs as follows.
1: Regression samples are listed in a fine resolution as N dimensional vectors; 2: train the SVR model initially, observe the multiplier (α i − α * i ) for all the SVs; 3: eliminate one SV with the smallest value of parameter |α i − α * i |; 4: replace training samples by the surviving SVs, return to step 2; 5: stop when the number of survivals is reduced to 10 × N ;
Finally, the surviving SVs are expected to have the most significant impact on the regression. The final number 10×N of SVs is determined from the trade-off between the accuracy and circuit scale.
Circuit Implementation
The circuit organization of ACU is described in Fig. 2 . The entire ACU consists of a set of Gaussian kernel circuits and a mixer. Receiving the well-trained parameters from software side, the kernel circuits are tuned to correspond different functions. The kernel circuit is designed by a set of p-and n-type MOS transistors with amplifier factors of K p and K n , respectively. An N dimensional vector in analog voltages of
is input as the operand. N sets of squaring generators are designed to approximately calculate the square functions between each pair of input voltages. For instance, the square-subtract function between v i1 and v j1 is reflected by the current I 1 or I 2 for the first dimension. Assuming v i1 is higher than v j1 , v i1 is boosted by a sufficiently small bias current to compensate the threshold voltage of n-type MOS transistor. In this manner, the current I 1 is generated by 2 and I 1 is zero. By collecting the current from all the dimensions as
The threshold voltage of p-type MOS transistor is denoted by V thp . In the block of exponential generation, V re f is biased to compensate the influence of V 0 (which can be easily generated by another exactly same I-V converter without any input current). Then, the output current I out is given in I out = I c 1+e ∆V ≈ I c 2 e −∆V . Finally, the result is output in the current mode as:
where I c 2 reflects the the parameter (α i − α * i ). Since all the results from Gaussian kernels are in the current mode, it is easy to mix them into the SVR expression by a simple current-mirror based collector:
To program the ACU and specify a target function, a set of voltages for SVs and corresponding current I c are applied to the kernel circuits according to the training results from the software side. The power gating technology is also suggested to eliminate the static power consumption from our previously presented paper [13] . Several examples are seen in Fig. 3 to illustrate various calculations by ACU, which are from the circuit simulation results. Two examples in Fig. 3(a) show the regression of cosine and exponential functions with the average error less than 1.5%. The calculation is processed in continuous-time with a delay less than 370ns. Another example in Fig. 3(b) shows the circuit simulation re- 
In this example, 20 SVs are employed to retrieve square-root function with the average error of 4.3% and the delay of 410ns. The entire ACU for two-input calculations consists of 600 transistors.
Expansion of Operands
Multi-operand calculations are implemented by simply increasing the number of dimensions and kernels. For instance, 9-D regressions are realized by the proposed ACU with 90 kernels, which are implemented by 9460 transistors. Compared with our previous work of 2-D regression (600 transistors [13] ), the circuit scale and inaccuracy greatly increase. Three examples of 9-input functions are illustrated by the circuit simulations as shown in Fig. 4 . In order to verify the nonlinear regression, the first example is given by the squareroot function as : f (x 1 , x 2 , . . . , x 9 ) = x 2 1 + x 2 2 + · · · + x 2 9 . From the circuit simulation results, the average regression error for this function is 10.8%. Other two examples are introduced as the typical convolution kernels in applications of computer vision [16] . The Prewitt filter for edge-detection is performed as f (x 1 , x 2 , . . . ,
with the average error of 5.4%; the Gaussian filter for blurring is performed as f (x 1 , x 2 , . . . , x 9 ) = 1 16 (x 1 + 2x 2 + x 3 + 2x 4 + 4x 5 + 2x 6 + x 7 + 2x 8 + x 9 ) with the average error of 6.5%. Since the expansion of inputs is realized by increasing the dimensions and numbers of Gaussian kernels in parallel, the processing delay does not noticeably differ from 1-D (370ns) to 2-D (410ns), even 9-D regression (390ns). To retrieve functions of high dimensional vectors, the necessary number of SVs increases from the fundamental of SVR theory. At the same time, the scale of kernel circuit addressing a specific SV also increases. Figure 5 shows the increment of average error and number of transistors for the non-linear function N i=1 x 2 i when N increases. For simultaneously computing the function with multiple inputs, all the kernel functions are executed in fully parallel. On the other hand, it is difficult to reduce the number of kernels due to the inaccuracy of regression. Therefore, the circuit scale is hardly reduced for a single large network of regression.
Two-Stage Architecture of ACU for Efficient Vector Calculation

Circuitry Organization
In many application fields of approximate computing such as image processing and machine learning, fast and efficient calculation of vectors is demanded. To efficiently implement the simultaneous calculation with multiple operands, a twostage architecture of ACU is proposed as shown in Fig. 6 . The N dimensional regression is separated into √ N groups of √ N dimensional regressions in the first stage; then, an additional √ N dimensional regression is employed to retrieve target function in the second stage. For many applications such as convolutions, the second stage of regression can be substituted by a linear combination of results from the first stage, which is simply implemented by the current collection. Since the inputs of ACU are voltage signals but the output is in the current mode, an interface circuit between two stages is necessary to convert the current to voltage. The currentto-voltage (IV) converter is described in Fig. 7 . Only three transistors are needed to perform the linear conversion from input current I in to output voltage V out . The transistor M1 is biased by a sufficiently small current I b . Then, the drainto-source voltage of M1 is V ds = V X ≈ V bias − V thn , where V bias and V thn are a constant bias voltage and threshold voltage of N-type MOS transistor, respectively. Since V out is lower than V x , M1 always operates in the linear region as:
A typical current-to-voltage characteristics is given by the circuit simulation results in Fig. 8 . It is noticed that the output voltage of IV-converter does not cover the full range of input for ACU. Thus, the regression of the second stage is regularized as h(x 1 , x 2 , . . . , x √ N ), where the mapping x i = ax i + b is obtained by the observation of IV-converter. More sophisticated IV-converter circuits can also be applied for a higher quality and wider range of conversion with the small overhead. However, the conversion range is convenient to compensate by SVR; and the acceptable distortion of conversion has no serious impact on the final results.
Circuit Simulations
For proof-of-concept, an ACU processor with two-stage architecture is designed and simulated for arbitrary calculations with nine operands. In this case, each ACU in both stages is implemented for three dimensional regression with 30 Gaussian kernel circuits. Four ACUs are organized as the entire processor consisting of 4960 transistors, which is about half of the prototype. Three examples are demonstrated as shown in Fig. 9 . Exactly same functions as those for verifying the prototype are introduced to investigate the accuracy. Since the nine dimensional calculation is separated into three groups of regression, each regression task is realized by the SVs with three-dimension. For the squareroot calculation (example 1), the first stage of regression corresponds to three sets of square-summation functions; the second stage corresponds to a root-summation function. For the calculations of 3×3 filters, the second stage is simple summation instead of SVR. From the simulation results, the regression errors for all the examples are slightly reduced, and the processing delay of 385ns is similar to that of prototype. As a minor benefit of two-stage architecture, the SVR training time (considered as "synthesis" of calculation circuits) is also reduced due to the reduction of initial samples and dimensions. In 9-D regression examples, the SVR training time through the LIBSVM-platform on exactly same computer server is reduced from 40s to 7s by using the two- stage regression. There is a potential to expand our proposed ACU to even higher dimensions by employing a hierarchical architecture with multiple stages if necessary. On the other hand, the multi-stage regression is made by mapping the target function into multi-layer combinations of Gaussian kernels. The analog implementations of those kernels are inherently faulty. Thus, inaccuracy of kernels is accumu- lated from each layer to the end. From the results in Fig. 10 , the error accumulation is compensated by the reduction of dimensions. However, when the number of stages increases, the error would accumulate faster than the compensation. In this sense, the multi-stage ACUs are only suggested for the applications with robust fault-resilience.
The two-stage architecture can be applied to execute the regressions of different dimensions. It is helpful to reduce the hardware cost and inaccuracy in general. Figure 10 shows that the increment of circuit scale and regression error is slowed down by applying the proposed two-stage architecture. For N-operand implementations, the number of transistors increases as O(N ), which is O(N 2 ) for the prototype. This fact offers the option to expand the operands in further with an acceptable hardware cost. Benefited by the reduction of circuit scale, the power consumption of ACU is also reduced. For the prototype of ACU with single stage, the operating power greatly increases from 0.011mW to 3.6mW when the number of dimensions increase from one to nine. By applying two-stage architecture, the power for 9-D regression is reduced to 0.88mW . On the other hand, directly regressing high dimensional samples is very challenging and risky on the side of regression algorithms. When the number of dimensions increases, the regression error and machine learning time will exponentially increase. Thus, partitioning the dimensions into reasonable groups is helpful to limit the regression error.
Comparisons
The comparison among several strategies for circuit implementations of approximate computing is given in Table 1 . Our proposed ACU is the only processor offering simultaneous calculations with multiple (more than two) operands. The conventional analog calculators such as multipliers are usually very compact but the available functions are specialized and limited. Simply reducing the bit-width of conventional binary processors hardly escapes from the trade-off between the cost and performance. The LUT-based calculators (both of binary and MVL) are helpful to improve the speed but suffer from the scale explosion problem. The Neural-Network (NN) -regression indicates a potential to shrink look-up table. However, due to inherent property of simple NN algorithms, the functional complexity and number of operands are limited. Our ACU can be considered a further effort by implementing regressions on-chip through advanced regression algorithms such as SVR. Complex functions of high dimensional vectors (in this paper, demonstrated by nine-dimension but can be even increased) are realized by a sets of ACUs with 4960 transistors in total. Due to the compact circuit scale, a large number of ACUs can be integrated in parallel for the applications such as image processing.
Discussions
Similarly to most of approximate computing hardware, the use of AUCs depends upon the application fields. There is no unique golden answer on the boundary or threshold of calculating accuracy [17] . However, some surveys reported that the quality of service (QoS) is roughly linear to the computational error when the average error is less than 10%, especially in the domain of computer vision [18] . By trading the implementation cost, users may adapt (or not) ACUs according to the expected QoS limitation.
The proposed two-stage architecture is suggested to in- crease number of dimensions for vector calculations. The mandatory constraint of utilization lies on the feature of target function: the elements in the vector calculations should be separable. Namely, the function should be able to represent by several independent components, which are implemented by corresponding ACUs. Fortunately, most of widely used functions in the real-world applications satisfy this constraint as such convolution, vector production, summation etc. In term of accuracy, the beneficial condition of two-stage ACU lies on the number of dimensions. For the calculations of low dimensional vectors, nine for instance, the proposed two-stage architecture performs higher accuracy over all the test functions. For the calculations of very high dimensional vectors, two-stage architecture is beneficially effective for the functions which can be expressed as the linear combination in the second mapping layer. Nowadays, most of very high dimensional vector calculations such as image filters or convolution kernels (typically, from 9-to 121-D) [16] meet this beneficial condition. The future challenge of AUCs lies on the development of specific memory system to "store" the target functions in terms of analog values. An analog memory block or an ordinary digital memory along with digital-to-analog converters can be applied to perform such a memory system with some overhead. Due to small size, a large amount of ACUs are expected to build in silicon with same but parallel functions. Then, the function-oriented memories are shared by all or many ACUs to eat up the overhead. In this sense, a reasonably high parallelism is suggested in real-world applications of ACUs.
Conclusion
In order to efficiently implement approximate calculations of vectors in general, the programmable analog calculation unit is developed in this work. From our early feasibility study, arbitrary functions can be approximately retrieved through the support vector regression algorithm implemented onchip. However, the circuit scale and calculation error greatly increase when the number of operands increases. A twostage architecture is proposed to separate the target functions with multiple variables into several groups of regressions with fewer dimensions. The free combination of multiple ACUs is applied to realize corresponding regressions. A current-to-voltage is also designed as the interface between two ACUs. The proof-of-concept processor for arbitrary calculations with nine operands is designed and simulated in a 0.18µm standard CMOS technology. The entire processor consists of 4960 transistors. Several example functions including 3 × 3 filters and square-root functions are employed to verify the behavior of proposed ACU. All the functions are retrieved in continuous-time with the error below 8.7%.
