Abstract-High power consumption has significantly increased the cooling cost in high-performance computation stations and limited the operation time in portable systems powered by batteries. Traditional power reduction mechanisms have limited traction in the postDennard Scaling landscape. Emerging research on new computation devices and associated architectures has shown three trends with the potential to greatly mitigate current power limitations. The first is to employ steep-slope transistors to enable fundamentally more efficient operation at reduced supply voltage in conventional Boolean logic, reducing dynamic power. The second is to employ brain-inspired computation paradigms, directly embodying computation mechanisms inspired by the brains, which have shown potential in extremely efficient, if approximate, processing with silicon-neuron networks. The third is "let physics do the computation", which focuses on using the intrinsic operation mechanism of devices (such as coupled oscillators) to do the approximate computation, instead of building complex circuits to carry out the same function. This paper first describes these three trends, and then proposes the use of the hybrid-phase-transition-FET (Hyper-FET), a device that could be configured as a steep-slope transistor, a spiking neuron cell, or an oscillator, as the device of choice for carrying these three trends forward. We discuss how a single class of device can be configured for these multiple use cases, and provide in-depth examination and analysis for a case study of building coupled-oscillator systems using Hyper-FETs for image processing. Performance benchmarking highlights the potential of significantly higher energy efficiency than dedicated CMOS accelerators at the same technology node.
Ç

INTRODUCTION
F
OR the last few decades, power has been a major constraint for very-large integrated circuits. In the past, increases in chip functionality were paid for through lowering the supply voltage and reducing transistor capacitance through the scaling of CMOS technologies. However, with the end of Dennard scaling [4] , further reduction of the supply voltage to reduce the power in Boolean logic has become challenging because of increasing leakage power with the ! 60 mV=decade subthreshold slope (SS) of CMOS devices. Consequences of this include high cooling cost in high-performance computation nodes, and limited operation time in portable battery-powered systems. Furthermore, the resultant shift in the economics of the virtuous cycle of investment in future process nodes holds back further reduction of cost per function. In response to these challenges, there has been rising interest in research on a collection of new devices with < 60 mV=decade SS and new architectures with higher power efficiency, as shown in Fig. 1 .
The goal of steep-slope devices is to further lower the power consumption by lowering the supply voltage for lower dynamic power while keeping low leakage current and sufficient ON-current for driving capability [5] , [6] , [7] , [8] .
Reported research on steep-slope devices include tunneling FETs (TFETs) [1] , negative-capacitance FETs (NCFETs) [9] , and also metal-insulator-transition (MIT) FETs such as the Hyper-FET [2] . The most direct application scenario for these steep-slope devices is similar to that of the conventional CMOS, in that they could be used as Boolean logic device with ON-OFF drain-source by the gate input control. Meanwhile, it is also noted that those devices may exhibit unidirectional conduction [10] , [11] , hysteresis [2] , [9] , non-volatility [12] , [13] , or other second-order considerations such as different device capacitance characteristics [14] . While the Hyper-FET can be used directly in this fashion as a MOSFET replacement, this is not the primary focus of this work, and, due to similarity to existing CMOS approaches, will be covered in limited detail.
On the emerging architecture, rather than device, front, one driving question is what forms computation can and should take going forward. In particular, there has been a renaissance in domain-specific processing, especially in graphics and computer vision, increasing acceptance of specialized accelerators as part of general purpose systems, and a willingness to embrace new models. One such architecture is "brain-inspired" computation, such as those used in neuromorphic [15] and other approximate computing platforms [16] , [17] , [18] . In this paper, we will show that the Hyper-FET based spiking neurons, compared with conventional integrate-and-fire (IAF) neurons, are much more efficient in the similar function with much lower area cost.
Another attractive feature of some non-Boolean architectures is the notion that they can "let the physics do the computing" [19] , [20] and, in so doing, achieve significant efficiency gains so long as the problem can be specified in a manner that matches the physical phenomenon. One such class of non-Boolean architectures for computation is sets of weakly-coupled oscillators. When a number of oscillators are coupled together, they will synchronize if their initial states are sufficiently close. Such synchronized oscillation, namely an attractor basin function, is observed across mechanical (e.g., pendulum), electrical (e.g., electronic oscillators) and human neural systems (e.g., neural-oscillators). These synchronized oscillatory systems have been shown to possess associative computational capabilities [19] , [21] . In this paper, we will show that the Hyper-FET based coupled oscillators are capable of forming area-efficient and powerefficient computation primitives for a range of applications, especially in image processing. Detailed device operating mechanism, circuit and architecture design, and performance evaluation will also be provided in this paper.
To ensure that investments in these new architectures and devices yield truly efficient systems, co-design of both devices and architecture is required. In this paper, we will focus on Hyper-FET based device modeling, circuit and architecture design, showing the potential of enabling new computation paradigms for higher power efficiency. The properties of circuits designed using these new devices are well-matched to the demands of existing algorithms in image processing and other domains. And device-circuitsalgorithm co-design is expected to bring even more benefits to these applications in terms of functionality, power efficiency, etc., with more degree of optimizations.
The remainder of the paper proceeds as follows. Section 2 includes the background of the Hyper-FET devices. Section 3 describes how Hyper-FET-based spiking neurons and networks are constructed. Section 4 shows the Hyper-FETbased oscillators, and how oscillator networks' synchronization behaviors can perform computations. Section 5 presents a case study in architecture and device co-design in the form of the implementation of a configurable oscillator network, and provides circuit-level validation that the tunable network effectively approximates a desirable family of mathematical functions. Section 6 presents the system-level approach to building a coprocessor fabric out of these tunable oscillator primitives, and how problems can be mapped to a single tile. Section 7 evaluates the computing efficiency on oscillator arrays compared to CMOS-based accelerators. Section 8 discusses related work and Section 9 concludes.
INTRODUCTION TO HYPER-FET DEVICES AND APPLICATIONS
The key novel feature of a HyperFET is the integration of VO 2 , a resistive switching device (RSD), with the transistor. The VO 2 material, i.e. Vanadium Dioxide, is an Insulator-toMetal transition (IMT) material that exhibits strong correlation of the resistance with external perturbation, such as temperature, pressure, and electrical stimulus [22] , [23] . Fig. 2a shows the voltage applied at the VO 2 device versus the current flowing through it, which has been proved to show a sharp change in resistivity up to five orders in magnitude at $ 340 K [24] . Circuit-level simulations employ a Verilog-A model. The Verilog-A model for emulating the rapid resistance transition characteristics is based on the calibrated characteristics of the fabricated VO 2 oscillator. The Verilog-A model is as follows:
module VO2(a, b); inout a, b; electrical a, b; parameter real R1 = 1k; parameter real R2 = 50k; parameter real V1 = 5.9; parameter real V2 = 0.5; parameter real tt = 100n; parameter real ini\_type = 0; The device model shows good agreement with the experimental results of the relaxation oscillator and the coupled oscillator described in Section 4.1. By using a resistor [23] or a MOSFET (as a current source) [3] in series to induce a negative feedback, this electrically induced phase transition in VO 2 can be modulated dynamically, resulting in an oscillation between high and low resistive states. There are also other approaches that model the similar resistive switching behaviors [25] , [26] . Fig. 2b illustrates the schematic of the experimental Hyper-FET consisting of a two-terminal VO 2 device in series with the source of a Si n-type MOSFET. In the experimental Hyper-FET setup, the VO 2 device is configured as an external device connecting in series with the MOSFET. The applied gate control voltage of the MOSFET modulates the channel energy barrier, and electrically triggers the abrupt state transition of the VO 2 material. Such abrupt resistivity change of VO 2 modulates the drain-source current (I D S) flowing through the MOSFET, induces a negative differential resistance (NDR) across VO 2 that results in internal voltage amplification achieving a steep-slope characteristic which further enhances the Hyper-FET performance beyond that of a conventional MOSFET.
Although not shown in the concept schematic in Fig. 2b , Hyper-FETs could also make use of FinFET technology to enable multi-fin structure. Figs. 2c and 2d plot the n-type and p-type transfer characteristics (I DS À V GS ), respectively, for the case where the number of fins equals three. The direct comparison of the n-type (p-type) Hyper-FETs with the stand-alone FinFET reveals an improved I DSÀON = I DSÀOFF ratio over a V GS range of 0:8 V (À0:5 V ), and thus a $ 20% ($ 60%) enhancement in I DSÀON at matched I DSÀOFF , respectively.
It is also noted that the hysteresis in Hyper-FET I DS À V GS curves may result in hysteresis turn-on/turn-off behavior in logic gates, and further, a more complex delay evaluation. Nevertheless, it has been revealed that hysteresis logic transfer behavior could be of great benefit when employed for better noise immunity [13] . Further exploration of such Hyper-FET hysteresis behavior in logic circuits, though not covered in this paper, shows more potential of applications in digital logic designs.
More introductions to the device fabrication could be found in [2] . The metal and insulator states resistance values would be determined by the dimensions of device width, length, and thickness, while the voltage conditions determined by the device length. Significant challenges remaining include fabrication of large device arrays with limited variability. This challenge must be confronted in two areas: At the growth level uniform film properties must be controlled across the wafer. Equally as important will be process optimization into to eliminate yield and variability challenges from device to device.
ANALOG SPIKING NEURON NETWORK
Spiking Neuron
Unlike Boolean logic with digital representations and clocked operations, brain-inspired systems exhibit more robustness and reliability based on distributed, eventdriven, collective, and massively parallel mechanisms. Such systems make extensive use of adaptation, self-organization, and learning [27] . Efforts to bridge the gap between the scale and performance of mammalian neural networks have turned to emulating certain aspects of the form of biological nervous systems as well as their abstract functionality, with the development of dense arrays of neurons wherein certain portions of the circuit act as axons, and synapses: Following the naming of the biological components, an artificial neuron is an Integrate-and-Fire (IAF) unit, receiving external excitements from the axons of the preceding neurons through the synapses, as shown in Fig. 3a . Despite decades of research on the implementation of silicon neurons, the current artificial neurons are still much larger in physical size and power than a general human neuron. Considering the large number of neurons in a biological-scale network, this imposes both performance and power-efficiency constraints. Consequently, reducing the power and chip area of artificial neurons is of significance in implementing larger systems for higher level tasks.
Given the two resistance states of the RSD, a spiking neuron is constructed by pairing the RSD with a transistor as a configurable impedance. Fig. 3b shows the structure of a spiking neuron cell with the synapse receiving the input spike. The resistance of a RSD R M switches between insulative (R I ) and conductive (R C ) states. To simplify the analysis, the equivalent impedance of the transistor is represented as R L . R M and R L are connected in series as a voltage divider, hence the voltage of the connection node V O has two stable levels
As shown in Fig. 3c , the neuron behavior with a preexcited spike contains three stages:
(1) the RSD charges C L to V C , (2) the RSD transitions to insulator state, (3) R L discharges C L to V I , and it stops here. The synapses, receiving input spikes and reduce the total equivalent impedance from V O to the ground, will lower the voltage level and trigger the output spiking. As shown in Fig. 3d , whenever the neuron receives sufficient input spike (s) from the synapses, V O reaches the triggering voltage and goes to the fourth step: (4) the RSD transitions to conductor state, with which the neuron spikes again.
The basic function of a neuron cell is generating a spike when receiving excitements over a certain threshold, which in our case is determined by V BIAS . Figs. 4a and 4c show the case the input spike doesn't trigger a output spike. Increasing V BIAS lowers the stable voltage V I , and vice versa. When V I is closer (farther) to the IMT condition of the RSD, the neuron need fewer (more) input spikes to trigger the output spike. Therefore, a higher V BIAS means the neuron is spiking based on a lower threshold of number of input spikes.
Figs. 4b and 4d show the case that the neuron with a higher V BIAS spikes for the same amount of input spikes.
Neuron Network
A single neuron is a device of extremely limited computational capability; Neural network models to solve complex problems demand large numbers of neurons deployed in an interconnected network. Thanks to its compact structure, the crossbar architecture is widely adopted for connections in silicon-neuron networks [28] . Fig. 5a shows the crossbar structure, in which the vertical lines are the axons (spike source) A N and the horizontal lines are the outputs of neurons N M , connected by the synapses Sy MN with respective weights W MN . In the following simulations of neuron network behaviors, we use the input pattern shown in Fig. 6a . Fig. 6b shows the neuron spiking behavior with different thresholds. As mentioned in Section 3.1, the higher V BIAS induces a lower insulative state voltage V I , and thus requires fewer input spikes to excite an output spike.
The weights of the synapses can be zero (no connection) or positive/negative value (positive/negative correlations). A positive (negative) correlation means the neuron is more (less) likely to spike if the source axon spikes. Fig. 5b shows the basic synapse, in which V ON controls the connectivity and V P =N switches the synapse between pulling high and pulling low. The synapse has the following operations: Fig. 6c shows the output spikes can have both positively (+) and negatively (À) correlative to the input spikes with the basic synapses.
In a rate coding mode the spikes can represent different value, while in the basic mode the spike contains only 1=0 ðTRUE=FALSEÞ informations. Fig. 5c shows the advanced hybrid synapse that can switch between shortterm (basic) and long-term (rate coding) modes. The diode connected transistor has single-direction propagations which charge the memory capacitor C M when the input spikes occur, and the other transistor in parallel works as a switch between long-and short-term modes. In the long-term mode V L=S is low, so the C M is not discharged during the falling edge of the input spike. Therefore, the hybrid synapse has two addition operations: Fig. 6d shows the rate coding mode neuron network behaviors. For each additional long-term positively correlative (++) input spike, the output spike increases the frequency, and vice versa. The spiking neural network can be used in simple applications like pattern matching or event detection [29] , or be constructed as the large-scale systems like convolution neural networks (CNN) to support more complex applications like written digit recognition [30] and face detection [31] .
OSCILLATORS AND COUPLING
Oscillators that weakly couple, as through a common substrate for mechanical oscillators, or via capacitive coupling among outputs in electrical oscillators, have collective synchronization properties that can be used to perform computation. To date, many of the systems designed to perform computation via oscillator coupling have primarily been intended to perform tasks in the fields of image processing, pattern analysis and computer vision. By utilizing the locking behavior of coupled oscillators as a computational primitive analogous to a distance metric, the systems are capable of performing associative matching functions. The recent development of nano-oscillator based associative memories has further enhanced the potential for oscillator based systems for intelligent information processing (details in Section 8). In most of these works, however, each oscillator network (i.e., the specific topology of and weighting of oscillator coupling) has been constructed in a homogeneous fashion that focuses on solving a specific problem with a given network. In this paper, we examine the ways in which the computational paradigm can be extended, and the networks configured, to support a broader family of functions for a specific domain on a single computing fabric where each tile in the fabric contains a tunable oscillator network.
In the rest of this section, we build upon the introduction to hyper-FET oscillators in Section 2 and discuss how the timing and degree of synchronization among N weaklycoupled oscillators corresponds to certain many-body computations. We then introduce the particular coupling topology of capacitive coupling with a common output node that we will focus on when considering coupled oscillators.
HyperFET Oscillator
Similar to the spiking neurons, Hyper-FETs can also be used to construct nano-oscillators. Fig. 7a shows the structure of a relaxation oscillator with the RSD, in which the resistance R M switches between insulative (R I ) and conductive (R C ) states. To simplify the oscillator model, the parasitic capacitance C P in Fig. 7a is lumped to C L . As shown in Fig. 7b , the oscillation cycle contains four stages:
(1) the RSD charges C L to V C , (2) the RSD transitions to insulator state, (3) R L discharges C L to V I , and then (4) the RSD transitions back to conductor state.
Without the phase transition, the system would tend to stay in one of the two stable voltage levels, V I and V C . The respective stable current amounts through the RSD (I M ) are 
with the current amount in the insulative state (I I ) is lower than that in conductive state (I C ).
For the system to oscillate, the stable region, determined by I I and I C , should be carefully designed to overlap with the conditions of the RSD characteristic transition. As shown in Fig. 7c , the RSD transition should be triggered before the system reaches the stable voltage levels. In the VO 2 example, the metal-insulator-transition and insulatormetal-transition (MIT and IMT) should occur before the voltage stabilizes, otherwise the system locks at a certain phase and does not oscillate. To realize a VO 2 -based oscillator the following must hold:
The cycle time of oscillation for a single oscillator is proportional to the RC time constant,
The oscillation of the circuit in Fig. 7a is not controllable. Replacing the resistor with a transistor introduces a control input, producing a voltage-controlled relaxation oscillator (VCRO). The transistor functions as a configurable impedance. By increasing the input voltage V IN in a certain range, the equivalent impedance of the transistor is reduced, and therefore the oscillation gets faster. The Verilog-A model of the VO2 device is for the ideal case without variations. The resistance values, R1 and R2, are related to the stable voltages, V I and V C in Eq. (1). As shown in Fig. 7c , the stable voltages (red lines) and the switching conditions (blue lines) would substantially determine the oscillation behavior. For the variations on resistances of the VO 2 device, a slight increment will induce the lowered V C (and V I ), and slow down (accelerate) the rising (falling) edge. The variations on the switching condition would affect the voltage of peak and valley of the oscillations accordingly.
Two Coupled Oscillators
Coupled nano-oscillators have been investigated to build associative primitives to accomplish cognitive tasks [32] , [33] , [34] . As shown in Fig. 9a , two relaxation oscillators are weekly coupled together with a capacitor in the traditional topology. In the proposed coupling network topology in Fig. 9b , each oscillator is linked to one capacitor, and those capacitors are connected to a common node V OUT . The traditional topology is equivalent to the proposed topology with oscillator size n ¼ 2 and coupling capacitance split into two (with C prop ¼ 2 Â C trad ). Fig. 9c shows the outputs for the case V IN1 ¼ V IN2 . The VCROs synchronize at the same frequency, and the phase difference between the outputs V O1 and V O2 stabilizes at p. To describe the time that both of the oscillator are discharging, T ðPi; VjÞ is the time from the peak of V Oi , Pi, to the valley of V Oj , Vj.
The current flowing from one oscillator, through the coupling capacitors, delays the other oscillators' next rising. For instance, the rising of V O1 induces current to the C prop s, and causes a delay on the discharging of V O2 and vice versa. The charge induced by the rising of V O1 on V O2 is discharged by the discharging current I D2 , which decreases when V O2 approaches the stable voltage V I . Consequently, the delay is longer when V O2 is closer to the next rising. That means when T ðP 1; V 2Þ < TðP 2; V 1Þ, the delay makes the discharging of VCRO2 longer than that of VCRO1, and therefore reduces the difference between T ðP 1; V 2Þ and T ðP 2; V 1Þ. As a result, the system stabilizes at the state of T ðP 1;
In the case of V IN1 6 ¼ V IN2 , the VCROs have different oscillating frequencies. However, the oscillations will still couple together if the input voltage difference DV is in a certain range. Fig. 9d shows the case of V IN1 < V IN2 . With the higher input voltage, the VCRO2 oscillates faster, and therefore V 2 gets closer and closer to P 1. However, the rising V OUT 1 induces a delay on the discharging of VCRO2, as previously mentioned, which extends the oscillation period of VCRO2. Because VCRO2 tends to oscillate faster, and the delay prevents it from speeding up, the oscillations are locked in an unbalanced phase difference but still at the same frequency. In a unstable or pre-stable oscillation, there is an ongoing phase change. If the oscillations are going to stabilize, the amount of phase change decreases in each cycle. As shown in Fig. 9d , the phase change is about 1 ms at the beginning and 0:1 ms after five cycles. We note that phase change is less than 1e À 4 ms after 10 cycles, and the system stays locked in phase as simulated for over 1,000 cycles. We declare the system has stabilized within 10 cycles because 1e À 4 ms is close to the time granularity of simulation. If we further increase the difference of V IN1 and V IN2 , the oscillators would have different frequencies while one oscillator has the shorter cycle (individual oscillation period plus the coupling delay) than the others. In that case, the out-of-phase oscillations can be observed in a few cycles.
N Coupled Oscillators
A key appeal of computing using coupled oscillators is that, by increasing the number of oscillators that are coupled together, the degree-of-difference of vectors can be simultaneously computed. The coherent oscillations are synchronized in the same frequency and stabilized at a constant phase difference to each other, and the coherence of the oscillators is correlated to the similarity of the input voltages. In the experiments, three-coupled oscillators have been measured as shown in Fig. 10a. Figs. 10b and 10c show the case of three-coupled oscillators in simulation.
Similar to two-coupled oscillators, the three-coupled oscillators synchronize at the same frequency and have equivalent phase difference, 2p=3 when they have same input voltages (Fig. 10b) . A VCRO with V IN higher than average tends to oscillate faster than the others and vice versa. Fig. 10c shows the case of three oscillators with unequal inputs (V IN1 > V IN2 > V IN3 ). As discussed in Section 4.2, the currents passing through the coupling capacitors are the key for the oscillator synchronization, as the voltage rising of one oscillator induces a current that delays the discharging of the others. However, this effect doesn't always play out. When the period difference is larger than the phase difference of any two oscillators, which means the delay induced by the slower one is not long enough to hold the rising of the faster one, the synchronization would break down. Consequently, the frequencies of the oscillators would not be the same, the V On of each VCROn would be out of phase, and some spikes would appear in the amplitude of V OUT . Thus, for highly different inputs, the measurable degree of difference becomes unstable, but still, via amplitude of V OUT spikes, registers as "very different" even though two different "very different" measurements are not themselves directly comparable.
The amplitude of the common node V OUT is lowest when the inputs to all the oscillators are equal. Fig. 11 shows the output amplitude of 3-synchronized oscillators. By sweeping V IN1 and V IN2 , there is a minimum point of amplitude on the 2-D space that corresponds to V IN3 . All those minimum points of different settings of V IN3 form a continuous line in the 3-D space.
In the simulations, up to nine oscillators can be synchronized. Fig. 10d shows the output waveform of 9-synchronized oscillators. For these oscillators, the synchronization time increases as the number of oscillators increases. As we go from n = 3 to n = 9, the coupling current induced by the rising of one of the oscillators is now shared by eight instead of two. Thus, to achieve coupling strengths among nine oscillators equivalent to that for 3, larger coupling capacitors C P are chosen. The enlarged C P introduces larger coupling current to the system, shared by more oscillators, and induce the equivalent delay to each oscillator. As more delays occur to the oscillation, the oscillation period is extended, and thus the frequency of producing measurable outputs from the collection of oscillators is much lower than the frequency of the individual oscillators. The number of cycles the oscillator array requires to stabilize decreases with n while the period of one cycle increases. For our configurations, the net effect is an increase in absolute time with n.
Phase Information
The swing of V OUT reflects the interaction of V O1 and V O2 ; V OUT rises when either V O1 or V O2 rises, and falls when both of them fall. In one cycle of the oscillation, there are two pulses of V OUT , one for the charging of VCRO1 and the other for that of VCRO2. Therefore, the pulses would be in the same amplitude if V IN1 ¼ V IN2 . The amplitude of the V OUT waveform increases if one falling time is shorter than the other, which happens when V IN1 6 ¼ V IN2 within the bounded range of allowable DV where the circuit is stable.
The output of synchronized oscillations has three properties. First, the synchronized oscillators generate a stable amplitude corresponding to the degree of match for the inputs with low deviation. Second, inputs with high deviation break synchronization, and the amplitude of V OUT becomes non-uniform. Third, an oscillator within a coupled oscillator network can be intentionally shut down by providing a large or small V IN out of the oscillation boundaries, as illustrated in Fig. 8 .
For inputs corresponding to the first property, the output behavior is close to the mathematical formulation of deviations (e.g., standard/absolute deviation) in the region of synchronization. Fig. 12 shows the simulation results of the oscillator-based deviation comparing to the corresponding mathematical model (standard deviation). The key difference between the two deviation approaches is that the output of the simulation results is less sensitive to a higher input, which is due to the non-linearity of g m of the transistor. To deal with non-synchronizing inputs, we employ thresholding to detect peak amplitudes beyond the acceptable range. In addition, forcing the shutdown of certain oscillators allows an N-oscillator array to emulate K-input functions for K < N. Fig. 13a shows the read-out circuitry. The input to the readout circuit is V OUT from the synchronized oscillators, which is DC-biased by the biasing network R B1 and R B2 . The source follower operating with I BIAS works as a buffer for V OUT . Finally, the diode-connected transistor rectifies and follows V SF to generate the readout voltage V ANALOG at the load capacitance C L . Fig. 13b shows the conversion from V OUT amplitude to the analog voltage output V ANALOG that the readout circuit performs.
Amplitude Readout Circuit
CONFIGURABLE OSCILLATOR COMPUTATIONS
As presented in the previous section, HyperFET oscillators provide a powerful, but limited-flexibility, primitive computational operation. In this section, we extend the computational capabilities of each oscillator by adding additional control inputs to configure its behavior. With these additional inputs, we can now efficiently realize a family of related primitives with a given oscillator network, rather than a single functionality.
The coherence of the synchronized oscillators is defined as the similarity of their oscillation frequencies. As mentioned previously in Section 4.1, the oscillation frequency of a VCRO is determined by V IN , which is linearly correlated to the discharging current I D ,
Generally speaking, the relation between I D and V IN can be configured by changing the resistance and the transistor size. Motivated by the fact that the similar I D s induce the similar oscillation frequencies, we explore the configurable mapping between V IN and I D in this section.
Configurability
Base Case
Starting from n ¼ 2, a coupled-oscillator structure is shown in Fig. 14a . The oscillation strength of a VCRO, which is defined as the individual oscillation frequency when it is not coupled, is positively correlative to the amount of the discharging current I D . Two transistors sized W 1 and W 2 are acting as voltage controlled current sources, and two resistors (R L1 and R L2 ) provide biasing current to the system. In the base case, the system is biased in balance, and the transistors have equal size. As shown in the right half of Fig. 14b , the lowest V OUT amplitude lays on the increasing when the VCROs are no longer coupled, and the maximum peak amplitude is around the amplitude of V OUT 1 or V OUT 2 . In this work, a "similarity line" is defined as the collection of conditions (V IN1 , V IN2 ) for which the lowest V OUT amplitude occurs. Essentially, the similarity line represents the condition that the oscillations are exactly coherent. The similarity line for the base case can be described as L S : fV IN1 ¼ V IN2 g. Accordingly, the circuit in Fig. 14a outputs a signal V OUT , of which the amplitude relates to the distance between the point ðV IN1 ; V IN2 Þ and the corresponding similarity line.
Shifting Case
The biasing resistors R L1 and R L2 can be designed to be unequal to each other, and the unbalanced biasing results in a shift on L S . Fig. 14c shows the case for an increased R L2 . When the biasing current decreases with the enlarged resistance, the total discharging current on VCRO2 drops under the condition V IN1 ¼ V IN2 . Therefore, VCRO2 oscillates slower, and the amplitude of V OUT increases for the unbalanced oscillation. To cover the biasing current reduction, VCRO2 needs a higher input voltage; i.e., a voltage shift on V IN2 results in the same driving strength compared to VCRO1. Effectively, V IN2 is subtracted by a value V S . As a result, the similarity line is shifted as
where V S is the voltage shift.
Narrowing Case
The transistor size of the VCRO corresponds to the ratio of current change to the input difference, and therefore effects the sensitivity of the oscillation frequency. Fig. 14d shows the case for an increased W 2. The increase and decrease amounts of current are proportional to the transistor size according to Eq. (6). Therefore, a given voltage variation induces more deviation to the output with larger transistor size, and narrows the width of valley. As a result, the different sensitivity factors ða1; a2Þ can be assigned to each of the VCROs by giving different transistor sizes.
In summary, given the configurable behavior of the coupled VCROs, the similarity line could be virtually a linear combination of V IN1 and V IN2 , i.e.,
Generally speaking, the minimum amplitude at the node V OUT occurs when the controlling currents I D of the two oscillators are equal, giving equal, or near equal frequencies, and enabling the oscillators to couple to a common frequency with equally distributed phases.
Configurable VCRO Module
Based on the the features described in Section 5.1, we can build configurable VCRO modules in systems of n-oscillators. Fig. 15a shows the structure of the proposed configurable oscillator. The discharging current I D is provided by three components: two transistors sized W and one resistor (R L ),
The transistor sizes of each of the synchronized oscillators could be different to give various g m ratios to the system. Essentially, those two transistors are replacing the transistor in the non-configurable VCRO (Fig. 14a) , splitting the input V IN to V x and V y . Fig. 15b shows the simulation results of that one configurable VCRO with sweeping input conditions, V x and V y , coupled to another with fixed input, V 0 x ¼ V 0 y ¼ 450 mV. Observed from the simulation, the minimum VOUT amplitude occurs when ðV x þ V y Þ=2 ¼ 450 mV, which means V x should be negatively correlated to V y for the same I D . Therefore, V y can be a configurable parameter that changes the correspondence between V x and the V OUT amplitude.
As shown in Fig. 15c , the transistor size can be flexible by splitting the transistor into multiple switch-controlled transistors. Similarly, the resistor can be replaced by another transistor to become reconfigurable.
Mathematical Expression
There are several useful mathematical vector operations that can be mapped to the behavior of synchronized VCROs. For a configurable VCRO, the voltages V x and V y are in a negative correlation for the same oscillation strength in terms of I D . To achieve the comparing function between the input x and target y, the numerical y is inversely mapped to voltage V y , as shown in Fig. 15d .
To visualize the relation between the configuration change and function change, we define the behavior of the synchronized oscillators as the analog domain and the parameters of configurations as the numerical domain. Specifically, we are trying to map the numerical range of the inputs, e.g., 1 to 256, to the active region of the oscillators in terms of voltage range, e.g., 0:35 V to 0:55 V. In Fig. 16 , the analog domain and the numerical domain are marked as the blue and red blocks, respectively. In the analog domain, there is an active region with the diagonal similarity line. For the base case, the numerical domain maps perfectly onto the numerical domain. In the shifting case, the input is equivalently subtracted by y2. Therefore, a part of the active region is unused (not mapped), while the remaining space of the numerical domain covers some inactive regions. In the narrowing case, the input number is multiplied by a factor a2. As a result, the extended space in numerical domain covers larger than the active region of analog domain. Thus, the similarity region is equivalently narrowed by a2 seen from the numerical domain. The mapping relations above demonstrate the reason for the shifting and narrowing in Fig. 14 .
For the n-synchronized oscillators, there is a similarity line in n-Dimensional space, indicating the condition of exact coherence for the n-oscillator system. After the mapping, the similarity line becomes L S : fa1ðx1 À y1Þ ¼ a2ðx2 À y2Þ ¼ Á Á Á ¼ anðxn À ynÞg in the n-Dimensional space, and ai is proportional to the transistor width Wi.
The functionality of the amplitude of V OUT is the deviation of the oscillation frequency, which is positively correlated to I D s. If the vector inputs X and Y are on the similarity line, the amplitude of V OUT is the minimum. Otherwise, V OUT returns the deviation DðaðX À Y ÞÞ (within the similarity region).
Based on the properties of the proposed configurable VCROs, the functions of an n-oscillator system include, but are not limited to:
Measuring the deviation of factor k of a set of input vector X, or checking if the deviation is above some threshold. The coefficient k is determined by the coupling and load capacitance, and k for mathematical standard and absolute deviation are 1 and 2, respectively. (II) DðaXÞ: Finding the matching degree for the n elements of X to a given ratio ( 
CONFIGURABLE OSCILLATOR APPLICATIONS
To parallelize the repetitive computation of, for example, image filtering, a system can be built with an array of processing units. Fig. 18a shows the system diagram of parallel computation with 10-by-10 array of oscillator-based processing modular units, each with nine coupled oscillators. The image with a large number of analog pixels is captured by the camera sensor, segmented into windows which are sized the same as the number of processing units (10-by-10), and processed by the units in parallel. The control signals, V a and Y, can be either identical or different to each processing unit, depending on the application. In most of the cases, the control signals come from the higher level architecture, and are fixed for the repetitive computations.
Each of the processing units is an independent 9-oscillator module that can perform the processing in parallel with other modules. As shown in Fig. 18b , the 9-oscillator module is composed of the 9-synchronized oscillators, the readout, and the thresholding circuits (voltage comparator). For the image processing applications as an example, nine input voltages (V x1 ; V x2 ; . . . ; V xn ), corresponding to one center pixel and the eight neighboring pixels, are connected to the oscillators. The control signals, in terms of V y and V a , configure the oscillator for the particular functions.
Distance and Deviation
The 9-synchronized oscillators naturally perform a deviation measurement by reflecting the distance between the input point X and the similarity line in n-Dimensional space. By fixing one of the inputs to, for example, zero, the similarity line becomes a point in the (n-1)-Dimension, as illustrated in Fig. 17 -III.
Image Filtering
To process the pixel values of an image, as the example shows in Fig. 18a , the oscillators are controlled by different configuration control values to perform various functions. Table 1 shows the configures of the oscillators for various functions, and Fig. 19 shows the original and processed images.
Salience and Edge Detection [35]
The salient point is more likely to be an edge if it locates in a region with more deviation. Obtained by measuring the deviation of nine neighboring pixels, a pixelwise salience map can be used as edge information.
Directional Edge Detection [36]
Another edge detection approach is to detect edges in certain directions. To detect a line in, for example, the vertical (column) direction, we can compute the deviation of three pixels in a row direction. The higher deviation in the horizontal direction indicates that the more likely it is an edge in vertical direction.
Dilation and Erosion [37]
By intentionally shifting the input to the oscillation boundary of the VCRO, the oscillation occurs only when the input voltage is higher (or lower) than the threshold. The synchronization doesn't really matter; detecting any high bit (or low bit) in a set of five pixels (central and four neighbors) returns the dilation (and erosion) filtering.
Color Detection [38]
The color information of a pixel can be described in three values in the R-G-B domain. Colors with the same R-G-B ratio would appear as the same color of different brightness. Therefore, the detection for a certain color can be done by 3-synchronized oscillators. With 4-synchronized oscillators (R-G-B and zero), the color range can be fixed around a specific pixel value, which can be another different application.
Weighted Pattern Matching
The X À Y function of the configurable oscillators essentially performs the matching between patterns. Independent weights can be achieved by assigning different a's to each oscillators, as shown in Fig. 20 . The target pattern Y is configured before sliding the window. Whenever the oscillatorbased deviation module finds a matching sequence, the thresholding circuit would output a bit-0, indicating a low difference between X and Y occurs.
PERFORMANCE EVALUATION
In this section, the performance evaluation is based on the properties and experimental results with VO 2 as the RSD, scaled to a reasonable comparative size, and projected to the feasible operation frequency. First of all, we compare the performance of a single 9-oscillator module with a customized CMOS application-specific integrated circuit (ASIC) pipelined accelerator designed to perform the same function. Then we compare the array of 100 parallel 9-oscillator modules with the CMOS-based data path from a system prospective. 
Modularized Deviation and Scalability
, where
The inputs X and Y for CMOS ASIC are both 9-element arrays with 8-bit elements, and the input a is a 9-element array with 2-bit elements. The CMOS ASIC baseline deviation module is a 42-stage pipelined accelerator, synthesized using Synopsys Design Compiler [39] . Based on the synthesis, the CMOS ASIC module can operate up to 500 MHz, and consumes 1,100 mW. The channel size of VO 2 determines the voltage and resistance, and thus changes the power consumption of the oscillator-based module. Fig. 22a shows the power estimation with the scaling of VO 2 channel dimensions. In order to compare the oscillator performance with 32 nm CMOS technology, we project the power consumption for scaled oscillators with dimensions ([W, L] = [60 nm; 36 nm]). We assume that the critical stimulus (electric field here) required for triggering the IMT in VO 2 remains constant [23] . We also assume that the resistance would remains constant with the same aspect ratio. Accordingly, the power consumption is 4:84 mW per module after scaling (synchronized oscillators: 4:59 mW, read-out: 0:19 mW, thresholding: 0:06 mW).
Although the oscillation speed is inversely proportional to the load capacitance, the phase transition time limits the increase of the operating frequency of the 9-oscillator modules. We found in the simulation that if the oscillation period is too short, the phase transitions of the oscillators would possibly overlap and make the synchronization less predictable. Consequently, we need a shorter phase transition time to make the module operates faster. Fig. 22b shows 
System-Level Comparison
From the system prospective, the 10 Â 10 parallel array of 9-synchronized oscillator modules (Fig. 18a) is used to achieve a higher processing throughput, saving the overhead of data conversion and transmission. Fig. 23 shows the sensor chip data paths, with the array of 100 proposed oscillator-based modules and with the conventional CMOS ASIC accelerator. Thanks to the low gate-count of the proposed 9-oscillator module (about 80 transistor counts, approximated to 20 gates per module), it's more likely be embedded in the same chip with the sensor units. In the conventional data path constructed by the ADC and offchip ASIC, the pixel values are converted and transmitted before being processed.
Using multiple 9-oscillator modules in parallel, the proposed oscillator-based accelerator can have a higher throughput, and consume less power than the CMOS-based ASIC accelerator. Table 2 shows the specification comparison between the proposed oscillator-based and the conventional CMOS accelerators. Operating at 500 Mpixel=s, the power for the state-of-the-art sensor, based on [42] and [43] , are 1:6 mW for the sensor and 6 mW for the ADC. For most general-purpose image sensors applications, the ADC resolution is equal or above 8-bit, so we use 8-bit for the applications in this work. In Fig. 23a , the proposed oscillator-based accelerator processes the pixel values without conversion, consuming 484 mW. In Fig. 23b , the CMOS-Based data path with ADC and ASIC consume 7:1 mW (6 þ 1:1 mW), which is over 14X the power of the former. Meanwhile, the processed results, instead of the pixel values, are transmitted off the sensor chip. Compared to the CMOS data path using 8-bit ADC, 1=8 ( 1 bit=pixel 8 bit=pixel ) data transmission bandwidth are used in the oscillator-based processing.
RELATED WORK
Recently, the concept of using coupled oscillator systems to perform non-standard computation has gained significant attention. Several previous works have demonstrated coupled oscillator systems geared towards a variety of computing tasks.
In [44] , [45] neural oscillators systems are shown to perform image segmentation based on degree of correlation. Those works focus primarily on the dynamics of neural oscillator networks and how they can be applied using a software model to accomplish the task of image segmentation. Further work on image segmentation using coupled oscillators appears in [46] . They explored additional coupling models as well as additional modes of operation (frequency vs phase locking) and coupling (fixed nearest-neighbor versus all-toall). Work in [47] has shown a system which utilizes both coupled oscillators as well as cellular neural networks (CNNs) to demonstrate contrast enhancement of images. In that work the CNNs and oscillator networks work together to perform the task of contrast enhancement in support of high quality image segmentation. In addition to segmentation, other forms of image processing have been researched using coupled oscillators. [33] further demonstrates the use of coupled oscillators to perform additional image processing tasks, edge detection and visual saliency. Using coupled Kuramoto oscillators [48] , [49] , [50] , that work shows how the locking and synchronization time of a system of oscillators can be used to strongly identify edge pixels in an image and fuzzy regions of stark contrast within an image, which can correspond to visually salient regions.
All of the aforementioned approaches presume an array of many oscillators connected in a fully-connected 2-D network, and the stabilization is slowed by the higher order of dependency among the oscillators. In contrast to the network topology used in previous works, this work focuses on a modular 9-synchronized oscillator unit in which the oscillators are coupled through a common center node and stabilization is faster than in a fully-connected 2-D network. The VO 2 -based architectures for implementation of the oscillator pairs have been demonstrated with RC time constant model simulations based on real devices fabricated using VO 2 [3] . The work relies on arrays of oscillators coupled in pairs and requires a significant portion of CMOSbased circuitry in order to read out the values from the array. In contrast, this work utilizes a different read-out architecture, based on a star-connected topology in which a single read-out node rather than each oscillator output node should be monitored, thereby eliding the complicated readout circuitry of previous approaches.
Template-and pattern-matching is demonstrated using coupled VCOs in [51] . The work details how the output of such an oscillator array may be used to determine a degree of match between two patches, or vectors, and therefore can be used to detect the template, from a set of templates, which is most similar to a test image. Extending this idea, [52] , [53] shows how such a correlation engine based on coupled oscillators may be used to implement parts of a much more complex image processing algorithm, HMAX, which is used for feature extraction. The most computeintensive stages of the HMAX algorithm, Gabor convolution and template correlation, are retargeted to oscillator architectures for processing. In [34] , [54] , arrays of coupled VCO oscillators are demonstrated in an associative memory architecture for image recall and reconstruction. In those works, the addresses as well as the content of the memory are template images, and indexing is done by finding the template which most closely matches the input. Prior work targets the particular compute-intensive portions of an algorithm with dedicated, fixed-function oscillator accelerators. In contrast, this work explores the possibility to offload the repetitive computations to a single, multi-function accelerator that is useful across many algorithms.
Those works demonstrate a variety of computational tasks that have been explored using coupled oscillator arrays. However, each of those architectures is geared specifically towards a given task. These rigid systems require multiple oscillator arrays to handle varying computational types as well as input sizes. This work proposes a dynamic architecture which includes dynamic adjustment and modulation of the input using control transistors. This not only allows online retargeting of the array towards different computing tasks, but also supports computations not previously explored, both within and outside of the image processing domain.
CONCLUSION
In this work, we show how HyperFETs, an emerging device based in IMT materials, align with three current powerreducing trends in emerging devices and architectures, namely steep-slope transistors, neuromorphic architectures, and non-Boolean processing paradigms. We describe the utility of HyperFETs as, or as part of, computational primitives in each of the three paths.
We present a case study in utilizing HyperFET-based nano-oscillators for visual computing, and validate a configurable circuit module of synchronized oscillators with multiple image preprocessing functions in addition to basic deviation measurement. Using different configurations, the response of oscillator discharging current to the input voltage is tunable to achieve a broader set of primitives within the oscillator-based processing module. Scaled to a size comparable to current CMOS technology nodes, the proposed 9-oscillator module operates 54X slower, but consumes 227X less power than a CMOS ASIC. The results also show that the 10 Â 10 array of 9-synchronized oscillator modules are able to provide comparable throughput (928M Op=s), using only 1/14 power (484 uW) compared to the CMOS-based counterpart. Baihua Xie recievd the BS degree in electrical engineering from Nanjing University, China, in 2014. He is a master's student with the Department of Electrical Engineering at the Pennsylvania State University. His research interest is mainly on hardware implementation of neural networks using emerging devices. Nagarajan Ranganathan received the BE (honors) degree in electrical and electronics engineering from Regional Engineering College, Tiruchirapalli, University of Madras, India, in 1983 and the PhD degree in computer science from the University of Central Florida, Orlando, in 1988. He is currently a professor in the Department of Computer Science and Engineering at the University of South Florida, Tampa. During 1998 to 1999, he was a professor of electrical and computer engineering at the University of Texas at El Paso. His research interests include VLSI system design, design automation, energy and power optimization, biomedical information processing, crisis management and homeland security applications, computer architecture, and parallel computing. He has developed many special purpose VLSI systems for computer vision, image processing, pattern recognition, data compression, and signal processing applications. He has published more than 200 papers in reputed journals and conferences and is a co-owner of five US patents. He served on the editorial boards for the journals: Pattern Recognition " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
