I. INTRODUCTION
Analog circuits are very appealing for signal processing at relatively low precision requirements, for example artificial neural networks, because they may far overcome digital circuits of the same functionality in circuit density, speed and energy efficiency [1, 2] . However, prior designs of individually adjustable memory cells for such analog circuits, for example "synaptic transistors" [3] , which are based on standard CMOS process, had relatively large cells (of area ~10 3 F 2 , where F is the minimum feature size), leading to larger time delays and power consumption [4] [5] [6] .
An alternative way forward has been enabled by the progress of the industrial flash memory technology, now featuring highly optimized floating-gate cells, which may be embedded into CMOS integrated circuits. For example, Fig.1 shows the "supercell" of the advanced commercial 55 nm ESF3 NOR flash memory from SST Inc. [7] . However, the original ESF3 technology is not suitable for analog applications, because its arrays do not allow for precise individual tuning of the state (floating gate charge) of each cell. In this paper, we report a redesign of the ESF3 cell arrays, which enables such tuning. The redesigned arrays were used to demonstrate a small vectorby-matrix multiplier operating in the low-power subthreshold mode, with gate coupling of array cells to the input (peripheral) cells. In order to reduce the temperature drift, pertinent to the subthreshold operation of the cells, we have implemented and characterized a differential version of the multiplier, which minimizes the output signal drift. TEM image of the cross-section of a "supercell" incorporating two floatinggate transistors with a common source (S) and erase gate (EG) [7] .
II. ORIGINAL AND MODIFIED MEMORY CELL ARRAYS

A. Original Array and Cell Programming
The SST NOR flash memory is based on "supercells" with two floating-gate transistors sharing the source (S) and the erase gate (EG), but are controlled by different word-line (WL) and coupling (CG) gates -see Fig. 1 . In the original ESF3 memory arrays, the cells are connected as Fig. 2a shows, with six row lines per supercell, connecting transistor sources, erase gates, coupling gates, and word-line gates, while each column has only one ("bit") line connecting transistor drains (D).
In this array topology, each cell may be programmed individually, by hot-electron injection into its floating gate. For that, the voltage on the source line (SL in Fig. 2 ) of the cell's row is increased to 4.5 V (while those in other rows are kept at 0.5 V), with the proper column selected by lowering the bit line (BL) voltage to 0.5 V (while keeping all other bit line voltages above 2.25 V). This process works well for providing proper digital state, with 1-or even 2-bit accuracy. However, it is insufficient for cell tuning with analog (say, 1%) precision. Unfortunately, the reverse process ("erasure"), using the Fowler-Nordheim tunneling of electron from the floating gates to the erase gates, may be performed, in the original arrays, only in the whole row, selected by applying a high voltage of ~ 11.5 V to the corresponding erase gate line (with all other EG voltages kept at 0 V). So, these arrays do not allow for a precise analog cell tuning, which unavoidably requires an iterative, back-and-forth (program-read-erase-read-program...) process, with the run-time result control.
B. Array Modification and Cell Tuning
We have modified the ESF3 memory arrays as shown in Fig.  2b , by connecting the erase gates of all cells of one column with an additional line, while eliminating the row lines connecting these gates. (Note that this redesign is different from the one performed by our group earlier [8] with the 180 nm ESF1 technology, because of a different structure of its supercells.)
In the modified arrays, the analog hot-electron programming of each cell may be performed by applying 10 s pulses of a fixed amplitude of 4.5 V to the source line of the corresponding row. In this process, the proper column is selected by applying a positive voltage ~4 V between the erase-gate and bit lines, while keeping this voltage negative for all un-selected columns [10] . Fig. 3a documents the inhibition of the unwanted programming process in a half-selected cell at the increase of the bit-line (i.e. drain) voltage.
The opposite process of individual analog erasure via the Fowler-Nordheim tunneling is now also possible, by using the new column lines to apply high-amplitude (11.5 V), 0.5 ms pulses to the erasure gates of the selected column. The proper row is selected by grounding the corresponding coupling gate line, while keeping a high voltage (+8 V) on these lines of unselected rows. As Fig. 3b shows, such a positive bias inhibits the Fowler-Nordheim tunneling in half-selected cells, due to a relatively high capacitance between the coupling gate and the floating gate of the same transistor [11] .
Due to the line rerouting, the array area per cell has nearly tripled -cf. Figs. 2c and 2d. However, even with this increase the area is still as small as 0.33 m 
C. Subthreshold Operation
The ability to tune the floating gate cells of the modified arrays continuously is illustrated in Fig. 4 which shows the readout current as a function of the coupling gate voltage for a selected equidistant series of cell states. These semi-log plots have wide quasi-linear segments, corresponding to the nearlyexponential behavior of the current in the subthreshold region. In the current range from 100 pA and 30 nA, the subthreshold slope factor n, defined by the well-known relation I  exp{qVCG/nkBT}, varies only from 5 to 5.1 for all the 15 states shown in Fig. 4 . This low variability of n enables the implementation of highly linear signal transfer in gate-coupled current mirrors using these cells [6] . 
D. Analog State Retention and Noise
The ESF3 flash technology guarantees a 10-year digitalmode retention at temperatures up to 125˚C [7] . To explore its analog mode retention, we have programmed 7 memory cells to 7 different states from around 100 pA to 100 nA covering the whole subthreshold region, and then were continuously monitoring their output current within a day under 85 ˚C as shown in Fig. 5a . Each point on this panel is an average over 128 samples taken during 16 ms periods. Fig. 5b shows the relative r.m.s. variation of the current during the measurement period for the 7 states shown in Fig. 5a . For larger currents the variation is below 1%, increasing to ~4% only at the lower boundary of the range. 
E. Temperature Dependence of the Readout Current
In order to fairly characterize the temperature dependence of the cell output current in the subthreshold region, we have programmed 8 cells to 8 different states, equally spread over the useful dynamic range. Then, in 3 different experiments, appropriate coupling gate voltages were applied to each cell, to make the readout currents of them all equal to, sequentially, 1 nA, 10 nA and 100 nA at 25 ˚C. After that, temperature was ramped up from 25˚C all the way to 85˚C, and the readout current of each cell was monitored. Fig. 6 shows the results of these 3 experiments. In accordance with our expectations (and the measured values of n), the currents increased significantlymore than by an order of magnitude for the lowest initial current.
Though in the gate-coupling scheme (see below) this changes are mostly compensated by similar changes in the input (peripheral) transistors, this fact still shows that the temperature sensitivity of the subthreshold current requires special attention -see Sec. IV below. 
III. VECTOR-BY-MATRIX MULTIPLIER
To implement the vector-by-matrix multiplication, we have used the gate coupling of the tunable floating gate cells of each row of the array with a similar "peripheral" cell, with the virtualbias condition imposed (by external circuitry) on the output (column) wires [1, 6] (Fig. 7) . (1) with current-independent proportionality coefficients wij, which are determined by the differences of threshold voltages Vth of the array cells and the peripheral transistors:
In turn, each threshold voltage is determined by the analog state (physically, the floating gate charge) of the cell, so that each wij may be adjusted to the desirable value (typically, below 1).
Thus the fundamental Kirchhoff's current law enables the implementation of the analog vector-by-matrix multiplication [4, 8] . Fig. 9 illustrates the analog tuning capability of the array. All 10×10 array cells have been tuned one-by-one by an automatic feedback controlled application of alternating programming pulses to their source electrodes and erasing pulses to their erase gates. After each tuning pulse, the external control circuitry read out the cell output current at standard bias conditions, and made a decision about the next pulse's destination and amplitude, until the read-out current has reached the target value with the 5% precision [9] . Fig. 9a shows the results of 3 separate experiments of tuning all 100 cells of the array to different target values of the output current: 1 nA (red line), 100 nA (green line), and an exponential function of the cell number, within the rage from 100 pA to 1 μA (yellow line). Fig.  9b shows the relative errors achieved in last experiment. The data mean that larger tuning errors (of the order of 10%) take place for smaller target currents, because of the relative large intrinsic noise of the devices. 
IV. TEMPERATURE INSENSITIVE OPERATION
According to Eq. (2), in the coupled-gate operation mode, much of the thermal dependence of the subthreshold current is compensated, but besides the special case wij = 1, the compensation is incomplete. Indeed, our measurements have confirmed that in agreement with this relation, that as temperature is raised from 25C to 85C, weight wij, initially equal to 0.9, increases by ~10%.
However, there is a straightforward way to decrease the temperature sensitivity, at the cost of a two-fold increase of hardware. For that, one can subtract output currents of two cells (say, those shown in Fig. 7) , with their individual weights tuned to, respectively, (wb + w/2) and (wb -w/2). Here w is the desired net weight, and wb is the "bias weight", which may be optimized to suppress the temperature dependence of the new output current. A straightforward analysis of this scheme, using Eq. (2), shows that after such optimization, the temperature drift of the output may be reduced to less than 1% at the [25C, 85C] interval, for any weight 0 < wij < 1. Fig. 11 shows the results of our preliminary experiments with this mode, showing the drifts not exceeding 2.7% in that temperature interval. 
V. SUMMARY AND DISCUSSION
We have demonstrated a simple prototype of an extremely compact analog vector-by-matrix multiplier, with a-few-percent precision and temperature drift, based on redesigned arrays of the commercial ESF3 NOR flash memory. While we have not yet directly measured the multiplier's bandwidth and energy efficiency, because of our current experimental setup limitations, our estimates, based on experimentally measured parameters of the cells, show that these figures-of-merit should be several orders of magnitude better than those of state-of-the-art digital multipliers with similar precision.
