Abstract-This paper reports a 24×57 correlation filter system for object tracking applications. While digital interfacing of the input and output data enabled a standard and flexible way of communication with pre-and post-processing digital blocks, the multiply-accumulate (MAC) operations were performed in the analog domain to save power and area. The proposed system utilizes non-volatile floating-gate memories to store filter coefficients. The chip was fabricated in a 0.13-µm CMOS process and occupies 3.23 mm 2 of silicon area. The system dissipates 388.4 µW of power at a throughput of 11.3 kVec/s, achieving an energy efficiency of 25.2 pJ/MAC. Experimental results for a custom filter designed to detect vehicles are presented.
I. INTRODUCTION
A DVANCED correlation filters (CFs) have been employed in a wide variety of image processing and pattern recognition applications such as automatic target recognition (ATR) and biometric recognition [1] . Among those, object recognition and tracking [2] - [4] have received more attention recently. Advances in designing robust but simple CFs that show a better performance at discriminating object and background have paved the way for implementing efficient object tracking systems using fewer computational resources.
Although digital realization of such computational systems provides a fast, flexible and precise solution, it consumes M. Judy and P. Liu are with The University of Tennessee, Knoxville, TN 37996 USA (e-mail: mjudy@vols.utk.edu; pliu7@vols.utk.edu).
N. C. Poore was with The University of Tennessee, Knoxville, TN 37996 USA. He is now with the Poore Insurance Group LLC, Shenandoah, TX 77384 USA.
T. Yang was with The University of Tennessee, Knoxville, TN 37996 USA. He is now with Analog Devices, Raleigh, NC 77386 USA (e-mail: tyang4@vols.utk.edu).
J. Holleman was with The University of Tennessee, Knoxville, TN 37996 USA. He is now with the University of North Carolina at Charlotte, Charlotte, NC 28223 USA (e-mail: jhollem3@uncc.edu).
C. Britton, D. S. Bolme, and A. K. Mikkilineni are with the Oak Ridge National Laboratory, Oak Ridge, TN 37830 USA.
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSI.2018.2819962 extensive silicon area and power. In fact, it has been shown that computational tasks that require low to moderate Signalto-Noise Ratios (SNRs) are more efficiently realized in analog than digital in terms of area and power consumption [5] . One of the most important problems with analog computing systems is the noise and offset accumulation which results in a significant degradation of accuracy. A common way to compensate for the offset in these systems is to manually calibrate the biasing current of the analog memories [6] , [7] or measure the offset for each output and store them in a separate array of analog memories and then subtract them from the output signals [8] . The former requires one analog memory per array element and the latter can only compensate for one of two inputs by fixing the other one. In both of these methods, the remaining offset depends on the programming accuracy. Apart from the noise and offset-related issues, fully parallel implementation of such operations requires input/output interface circuitry capable of supplying/acquiring a large number of analog data simultaneously to/from the computational block. Although implementing such interfaces is essentially a challenging task for any system, this becomes more of an issue when dealing with noise-and mismatch-sensitive analog signals. Therefore, it is a key to develop configurable digital interfaces that can easily scale with the number of inputs and efficiently communicate with other pre-and postprocessing blocks. Several architectures implementing such 1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. hybrid/mixed-signal approaches have been reported in the literature. In [9] a bit-serial input/bit-parallel output architecture has been proposed that demands one flash ADC per output and needs further off-chip processing. Moreover, it utilizes DRAM memories which require constant refreshing. The architecture proposed in [10] employs a large number of SRAM memories and other supporting digital logic circuits which add excessive power overhead to the system. In this paper, we present an energy-efficient digital I/O interface solution for an analog correlation operator for linear filtering which maintains the power and area efficiency of the entire system and easily scales with the number of inputs. Furthermore, the proposed system utilizes non-volatile analog floating-gate memories (FGMs) as storage devices, eliminating the need for DRAM/SRAM memories. Also, the proposed architecture incorporates techniques to reduce the effects of analog circuit imperfections such as offset and noise.
II. SYSTEM DESCRIPTION
The correlation of two vectors w(m) and x(k) is defined as:
In this equation w(m) is the filter coefficient vector, where
and y(n) is the output vector, where 0 ≤ n ≤ N − 1. It should be noted that in an array with M coefficients and N outputs, the input vector has M − 1 more elements than the outputs. Fig. 1 illustrates a fully parallel implementation of an (M+1)×N correlation filter. M is chosen to be an even number in (1) and Fig. 1 to simplify the illustration, but it can be an odd number as well. In this paper, an M×N prototype CF with M = 24, N = 57, and 80 inputs is presented. The proposed architecture for a digitally interfaced fully parallel analog correlation filter system (CFS) is shown in Fig. 2 . The main signal processing task, i.e. the correlation function is implemented in analog domain using an analog multiplier array. The analog multiplier is realized with a few transistors which not only results in an energy-and area-efficient system, but also significantly reduces the interconnection complexity of the design. Thanks to the utilization of current-mode signals, the summation in (1) is performed by simply connecting the output of multipliers in each column together according to Kirchoff's current law (KCL). A correlated double sampling (CDS) technique has been implemented to cancel offset and to reduce the low-frequency noise at the outputs of the analog array.
This architecture takes advantage of a digital scheme for front-and back-end interfacing which provides a high-speed, noise-and offset-tolerant, easily scalable, and standard way of communicating between the analog signal processing block and digital processors without undermining the energy efficiency of the system. The time-domain multiplexing (TDM) approach was adopted because it works well with both currentand voltage-mode signals, and allows sharing the power-and area hungry blocks such as analog buffers and data converters between several inputs and outputs.
The front-end interface converts the digital pixel vector of the input image to analog pixel vector using current-steering digital-to-analog converters (DACs) and delivers them to the multiplier array. The back-end interface converts the analog output vector back to digital. The filter coefficient vector is stored in an array of FGMs. The FGMs adopt standard thick oxide I/O devices available in standard CMOS processes. They do not require high-voltage switches or charge-pump circuits for programing, and they are capable of holding their values for days without refreshing. The FGM outputs are shared across N multipliers in each row.
A unique aspect of the proposed system is the direct interfacing of the FGM and the DAC outputs to the multipliers. As it will be discussed in detail in Section IV, the multiplier array is constructed of Gilbert multipliers with the first multiplication factor applied to its voltage input and the second multiplication factor applied to its current input. Accordingly, the FGM output, which is a voltage signal, is used as the first, and the DAC output, which is a current signal, is utilized as the second multiplication factor. Eliminating the need for extra interfaces or converters between the memory elements and the multipliers, as well as the DACs and the multipliers has significantly increased the energy efficiency and decreased the circuit complexity of the system. The building blocks of the system are discussed in detail in the following sections.
III. THE FRONT-END INTERFACE
The front-end interface comprising five DACs followed by eighty sample-and-hold (SH) circuits. Every 16-channels share one DAC using a TDM scheme. It should be noted that an 8-bit address space is chosen for input data to account for future growth. The lower 4 bits choose the DAC and the upper 4 bits select the SH channel. Fig. 3 shows the timing diagram of the front-end interface.
A. Digital-to-Analog Converter
An 8-bit current-steering DAC was designed to convert the digital input signals to analog signals for the analog computation block. As shown in Fig. 4a , the DAC uses segmented topology: 4 MSBs are thermometer coded, and the 4 LSBs are binary weighted. The segmentation helps to reduce differential non-linearity (DNL) and integral non-linearity (INL). An operational transconductance amplifier (OTA) is used in a cascode current sink configuration to increase the output resistance. A detailed description of this design can be found in [11] . 
B. Current-Mode Demultiplexer/SH

IV. ANALOG MULTIPLIER ARRAY
The four-quadrant Gilbert multiplier circuit is shown in Fig. 5a . The multiplier is formed with NMOS transistors M 1 − M 4 operating in the moderate inversion region. The multiplier linear region was extended using voltage-controlled degeneration technique [12] . Using the equation provided in [13] , the differential transconductance can be written as:
where I in is the input current, n is the sub-threshold slope factor, U T is the thermal voltage, and L is the ratio of transconductance parameters of M i and M i,a i.e. β i /β i,a , i = 1, 2, 3, 4. The linearity was found to be maximum for L ≈ 2.5. Transistors M i were chosen to be triple-well devices to eliminate the body effect and hence, to improve multiplier linearity with respect to the current input. Moreover, isolation from the substrate improves the noise performance. It should be noted that achieving an 8-bit dynamic range and tolerable mismatch-related errors come at the expense of using long-channel devices, with the result that each multiplier cell occupies an area of 30μm × 40μm. Nevertheless, this design is markedly smaller than a digital counterpart.
A. Power and Speed Performance
The input stage shown in Fig. 5b dominates the frequency response of the CFS. From a small-signal analysis of the selfbiased cascode current mirror, the dominant pole is:
where g m1 is the transconductance of the transistor M 1 and
and j is the number of multipliers connected to the input node and the worst case is when the input node is connected to multipliers in all of the rows ( j = M). By substituting the sub-threshold equation for g m1 , the inverse time constant can be written as:
This equation shows that the inverse time constant linearly scales with the input current level and the number of rows in the array. Fig. 5c plots the measured inverse time constant for a given input current level. Total power consumption of the N × M multiplier array is:
Assuming a settling time of 5τ for the array, the power-delay product is then:
This equation shows that the power-delay product for multiplier array is a linear function of N and a quadratic function of M. It is also independent of input current, suggesting that operating at higher speed by increasing the input current level does not reduce the energy efficiency. However, increasing the input current level increases the non-linearity because transistors start moving out of the sub-threshold region.
B. Noise Performance
In the multiplier circuit shown in Fig. 5a the shot noise is the dominant noise source due to the large size of the devices. The noise spectral density of transistors in saturation can be calculated from the following interpolation equation [14] :
where I s is the saturation current and Transistors M i (i = 5, 6, 7, 8) operate in the sub-threshold saturation region where x = 0 and therefore, their noise spectral density can be approximated by:
It can be shown that the total output noise is given by (see the Appendix):
Substituting the equivalent noise bandwidth (ENB) of 1/4τ in the above equation we obtain:
Thus, the RMS signal-to-noise ratio (SNR) can be written as:
where k is the Boltzmann constant and T is the absolute temperature. Based on this equation, increasing the input capacitance leads to higher SNR, which is indeed nothing but the well-known trade-off between the bandwidth and SNR. From (5) and Fig. 15c the C i is estimated to be 323 fF. With a V id = 0.1 V, kT = 4.11×10 −21 J, and n = 1.42 from simulation, expected SNR is 63.7 dB.
V. NONVOLATILE FLOATING-GATE MEMORIES
An array of floating-gate (FG) memories is employed to store the analog filter coefficients in a differential mode. The schematic of the FG analog memory cell is shown in Fig. 6 [15] . The gate of M 1 , M 2 and M 3 and the top plate of capacitor form the FG. The stored charge on the FG is modified by the injection process through M 1 and the tunneling process through the transistor M 2 . Tunneling removes electrons from the FG node while injection adds electrons. Both of these processes change the amount of 
In the tunneling mode, VTUN is connected to 7 V, and VDDT is switched from 3 V to 1 V to reduce the FG voltage and increase the gate oxide voltage, V ox . The amount of charge added or removed from the FG is controlled by the pulse width of VDDI and VTUN signals, respectively. In the injection mode, VDDI is switched to 3 V from GND, VTUN is switched to GND and VDDT is switched back to 3 V to prevent tunneling. In the read mode, VTUN and VDDI are switched to 0 and VDDT is switched to 3 V to make sure no tunneling or injection is happening. Upon activation of 'read' signal, the output of the selected cell is connected to a pad and read by off-chip read-out circuitry. Because the programming process utilizes feedback based on V o , non-linearities, finitegain effects, etc. are accounted for. Table I summarizes the FGM operation modes and the control signals. In general, the programming time depends on the number of floating-gate memories and the target values. For this prototype, it takes about one minute in average to program all the floating gates. The RMS error between target and actual values for all of the floating-gate memories was less than 1mV. This was calculated based on the errors measured at the V o node. According to a retention test, no measurable leakage was detected after more than five days [16] .
VI. THE BACK-END INTERFACE
An array of 57 current mirrors performs differential to single-ended conversion. The difference currents are then integrated into the capacitors. The analog output voltages are multiplexed into four unity-gain buffers driving four 8-bit SAR ADCs. The timing diagram of Fig. 7 illustrates the sequence in which ADCs perform the conversion and output digital data. Fig. 8 shows the schematic of current mirror followed by the I-V converter circuit. An OTA keeps the voltage on the I op node at V R E F to improve the accuracy. Using (1) and (2) the output voltage of the integrator can be written as:
A. Current mirrors and I-V Converters
Where C int is the integration capacitor and T int is the integration time. The CDS technique was implemented on the CFS chip to cancel offset at the array outputs. The CDS is performed in two phases. In the first phase, the off-chip processor writes reference input, then updates the analog multipliers and integrates the output current. In the second phase, the processor repeats the same process, this time for data input. The data input is subtracted from reference input by switching plates of C int between the two phases.
B. Analog-to-Digital Converter
A conventional 8-bit SAR ADC was designed to convert analog output vector to digital output vector. The 6.5 fF unit capacitors were realized using VNCAP devices and were arranged as an array with a common-centroid configuration to decrease mismatch effects. A single-cycle charge pump circuit [17] was used to boost the gate voltage of the reference switch which made it possible to use the supply voltage as the reference voltage. The comparator circuit comprises a regenerative circuit followed by a differential to single-ended amplifier stage.
VII. EXPERIMENTAL RESULTS
A prototype 24×57 CFS was designed and fabricated in a 0.13 μm CMOS process. The chip area is 1.7 mm×1.9 mm. 9 depicts annotated chip micrograph. The CFS chip was evaluated using a custom test board interfaced to a PC via an FPGA board. The FPGA board controls the write, update, integrate and read timings and communicates with the PC through a USB-UART interface. Fig. 10 shows the input/output characteristics for a column of multipliers. The output is taken after the charge integrator, which converts the current output to a voltage. The measured worst-case INL and DNL for the inputs were +4.7/−5.2 and +1.8/−1 8-bit LSBs, respectively, whereas those of the weights were +1/−1.2 and +0.26/ − 0.27 6-bit LSBs, respectively.
The input-referred offset, current noise, and SNR with and without performing the CDS offset cancellation technique are depicted in Fig. 11 . Since the measured values are the sum of several independent random processes, a Gaussian distribution is assumed here for calculating the mean and the standard deviation. If the number of samples (i.e. outputs) were sufficiently large the Gaussian distribution would be more evident (the Kolmogorov-Smirnov test also supports the Gaussian assumption). Implementing the CDS technique reduced the offset from 53.3 nA to 4 nA (13.2X reduction). The SNR without the CDS is 53.9 dB with a standard deviation of 4.62 dB. The CDS also reduced low-frequency noise (by a factor of 2.7) and improved the average SNR from 53.9 dB to 61.6 dB. Table II summarizes the specifications of the system. Operating at 6 MHz write speed and 2.4 MHz read speed, the entire system achieves 25.2 pJ/MAC of energy efficiency at 11.3 kVec/sec throughput. The multiplier array and the unity-gain buffers were the most power-hungry blocks in the system and therefore they were turned off during the standby intervals. As a result, as shown in Fig. 12 , 48 percent of the power was saved. Table III compares key specifications of this work with similar systems. The 8-bit CFS chip presented in this paper uses non-volatile floating-gate memories to store filter coefficients and achieves better total energy efficiency compared to other systems reported with similar functionality. It should be noted that the analog array of [6] implements single-quadrant multiplication while this system implements four-quadrant multiplication which requires extra transistors and consumes more power.
To demonstrate the effectiveness of the designed CFS in object tracking, a custom filter was designed to detect vehicles based on the MOSSE algorithm [2] . Fig. 13 shows the designed filter kernel. This two-dimensional filter is then decomposed into a vertical and a horizontal filter (Fig. 13b) .
The test image went through a few preprocessing steps to reduce shadow and intense lighting effects [3] . The preprocessed image rows were scanned into the chip and correlated with the vertical filter to form a temporary output. Afterward, the columns of the temporary output were scanned into the chip and correlated with the horizontal filter to produce the final output which is expected to exhibit a sharp peak when there is an authentic match between the filter and the target in the input image. Fig. 13d depicts the expected output from an ideal digital filter implemented in MATLAB for a test image from the DARPA VIVID dataset [19] and Fig. 13e shows the measured results. The final outputs of the analog filter match closely with the simulated digital filter, indicating negligible degradation due to noise, mismatch, etc. The analog filtered image shows a strong peak at the target location which clearly discriminates the target from the background. It is worth noting that this prototype only includes one set of floating gate memories; however, the future versions of the chip will include multiple kernels that can be selected at run-time, so that both the horizontal and vertical filters can be executed with no reprogramming. A similar solution could be implemented using two of the current chips, with one programmed for the horizontal filter and one for the vertical filter.
VIII. CONCLUSION
A 24×57 correlation filter system has been presented. The proposed system performs MAC operations in the analog domain and provides a standard and scalable digital I/O interface for data transfer. An array of non-volatile floating-gate memories are used to store filter coefficients. The CFS chip dissipates 388.4 μW of power at a throughput of 11.3 kVec/s, achieving an energy efficiency of 25.2 pJ/MAC. The fabricated chip occupies 3.23 mm 2 of silicon area in a 0.13 μm CMOS process. A custom filter based on the MOSSE algorithm to detect vehicles was programmed into the CFS chip. The result of applying the filter on an image from the DARPA VIVID dataset was presented, showing a strong peak at the location of the target.
APPENDIX DETAILED MULTIPLIER NOISE ANALYSIS
If we consider the current noise of the tail transistor M 5 , as shown in Fig. 14a , the current gain from the node A to the the positive and negative outputs can be written as:
Therefore, the noise contribution of M 5 to the output is:
As depicted in Fig. 14b , the current noise of transistor M 1 can split into two correlated noise sources with the same values, one at the drain of M 1 and the other at the node A. Thus, the noise contribution of M 1 to the output is: And finally, the current noise of transistor M 1,a can split into two noise sources at node A and node B as shown in Fig. 14c . Hence, the noise contribution of M 1,a to the output is:
Consequently, the noise contribution of all the transistors to the output can be calculated as: He was involved in electronics for PHENIX at RHIC, heavy-ion experiments at CERN, detector electronics for the Spallation Neutron Source, solid state neutron detectors, and the first 2-D neutron pixel detector. His career focus has been on custom integrated circuit design for a variety of disciplines. He was involved in the design of fast, lowpower mixed-signal CMOS electronics for the fieldable nuclear materials identification system for the U.S. Department of Energy, the Time-ProjectionChamber upgrade for the ALICE collaboration at CERN, radiation-hardened analog integrated circuits for space exploration and reactor accident cleanup applications and direct-sequence, spread-spectrum radio communications for remote sensor applications. He is currently managing the hardware development for the Authenticatable Container Tracking System for the DOE EM Packaging Certification Program. He has over 100 publications 21 issued patents in the field of electronics and integrated circuit design. He was a recipient of the 2014 R&D100 Award. He is an Associate Editor for the IEEE TRANSACTIONS ON NUCLEAR SCIENCE. 
