On-device intelligence is gaining significant attention recently as it offers local data processing and low power consumption. In this research, an on-device training circuitry for threshold-current memristors integrated in a crossbar structure is proposed. Furthermore, alternate approaches of mapping the synaptic weights into fully-trained and semi-trained crossbars are investigated. In a semi-trained crossbar a confined subset of memristors are tuned and the remaining subset of memristors are not programmed. This translates to optimal resource utilization and power consumption, compared to a fully programmed crossbar. The semi-trained crossbar architecture is applicable to a broad class of neural networks. System level verification is performed with an extreme learning machine for binomial and multinomial classification. The total power for a single 4x4 layer network, when implemented in IBM 65nm node, is estimated to be ≈ 42.16µW and the area is estimated to be 26.48µm x 22.35µm.
In 2008, a successful physical implementation of a synapse-like device called memristor was proposed by Strukov [Strukov et al. 2008] . Theoretically, the memristor was introduced by L. Chua in 1971 as a fourth fundamental electrical device that correlates the flux and charge in a non-linear relationship [Chua 1971 ]. The memristor acts like a non-volatile memory element [Borghetti et al. 2010] , consumes low energy [Prezioso et al. 2015; Merkel 2017] , has a small footprint compared to transistors, and can be integrated in high density crossbar structures [Jo et al. 2010] . A key advantage of the crossbar structure is that it enables performing the most computationally intensive operations (multiply-accumulate) in neural networks concurrently while consuming small amount of power compared to conventional implementations Taha et al. 2014] . These properties make the memristor a natural choice for realizing neural networks in an efficient manner such that it meets embedded device constraints. Typically, memristive devices are used to model the bi-polar synaptic weights in neural networks. Due to the fact that a memristor exhibits properties similar to that of a resistor, memristors can represent only positive range of weights. Thus, either a hybrid CMOS-memristor Soudry et al. 2015] or two memristors [Alibart et al. 2013; Hu et al. 2014 ] are used to model the bipolar synaptic weights. Although modeling the synaptic weight with one memristor is easier to train, it demands additional circuitry to generate the bipolar weights. On the other hand, using two memristors to model the synaptic weights reduces the power consumption, but increases the hardware complexity and the training process.
Several research groups have studied the realization of synaptic weights in memristive devices while enabling the on-device learning. To realize the synaptic weight with one memristor, Sah et al. proposed a memristor-based synaptic circuit which employs an H-Bridge and doublet generator to perform positive and negative input-weight multiplication. However, it was not studied in the context of a multi-level network or a crossbar architecture . In 2015, Soudry et al. presented a memristor crossbar that supports on-chip online gradient descent. In this architecture, two transistors and a memristor were used to implement a synapse. This makes the total number of transistors in the crossbar scale linearly with the number of memristors [Soudry et al. 2015] . Adopting two memristors to model the synaptic weights is studied by Alibart et al. who proposed a memristor-based single-layer perceptron to classify synthetic pattern of the letters 'X' and 'T'. The proposed design is trained using ex-situ and in-situ methods [Alibart et al. 2013] . Hasan et al. presented an on-chip training circuit to account for device faults and variability in memristor-based deep neural networks [Hasan et al. 2017] . This network was trained with auto-encoders and backpropagation and simulated in MATLAB for classification application. When it comes to extreme learning machine (ELM), which is the neural network algorithm used to verify our proposed architecture, few research groups have studied the memristor-based ELM. In 2014 Merkel et al. proposed memristor-based ELM implementation, but it is not studied within the context of a crossbar structure [Merkel and Kudithipudi 2014] . Later in 2015, OxRAM based ELM architecture was proposed by Suri et al. in which the nano-scale device variability is exploited to design ELM in an efficient manner [Suri et al. 2015] . Unfortunately, this work does not provide details about the hardware implementation and the training process. It is also important to mention here that most memristor-based neural network architectures proposed in literature use threshold-voltage memristors. To the best of our knowledge, no design has explored on-device learning for current-threshold memristors integrated in crossbar architecture.
This paper proposes on-device training circuitry for current threshold memristors integrated into a crossbar structure. Moreover, the paper presents a different approach for realizing the synaptic weights into a memristive crossbar such that bipolar weights are obtained. The proposed approach is based on using semi-trained crossbar structure (a combination of trained and untrained memristors), where the trained memristor models the synaptic weights and the fixed ones are used in association with the trained memristor to generate bipolar synaptic weights. The proposed design is simulated in Cadence Spectre and verified for classification application in MATLAB using binomial (Diabetes and Australian Credit) and multinomial (Iris and MNIST) datasets [Lichman 2013; LeCun 1998 ]. For a single 4x4 layer network (crossbar and its associated control and training circuitries) implemented in IBM 65nm technology node, the total power is estimated to be ≈ 42.16µW, while the area is 26.48µm x 22.35µm.
The rest of the paper is organized as follows: Section 2 presents an overview about ELM. Section 3 and 4 discuss the design methodology and the hardware analysis. The experimental setup is described in Section 5. Section 6 demonstrates the experimental results and Section 7 concludes the paper.
OVERVIEW OF ELM
Extreme learning machine (ELM) is a multi-layer feed-forward neural network used in real-time regression and classification applications [Huang et al. 2004] . It has roots back in the random vector functional link (RVFL) networks proposed in 1994 [Pao et al. 1994] . Primarily, ELM is composed of three successive fully connected layers: input, hidden, and output. The input layer is used to present the input data to the network, whereas the hidden and output layers conduct the feature extraction and data classification, respectively. When the input data is presented to the network, it gets relayed to the hidden layer, where all the relevant and important features are stochastically extracted [Auerbach et al. 2014] . This is done via projecting the input data to high-dimensional space carried out by a large number of hidden neurons [Huang 2014 ]. The features extracted by the hidden layer are further relayed to the output layer where the class label associated with the input is identified. A key feature of ELM is that the training is confined only to the output layer synaptic weights, whereas the hidden layer weights are randomly initialized and left unchanged [Huang et al. 2006 ]. This feature speeds up the training in ELM and makes the algorithm attractive for hardware implementations as there is no need for back-propagation. Figure 1 illustrates the high-level architecture of an ELM. At runtime, each example in the input dataset is presented to the network as a pair. Each pair contains an input feature vector X p and its associated class label t p , where X p ∈ R n , ∀p = 1, 2, ...., L and L is the dataset size. Using Equation (1), the network feed-forward output can be computed, where t * i represents the predicted output of the i th output unit, ∀i = 1, 2, ....k. k and η denote the total number of neurons in the output and hidden layers, b is the bias, while f and z are the activation functions of the hidden and output neurons, respectively.
Adopting the normal equation, Equation (2) (hidden layer output inverse (H −1 ) multiplied by the desired output class labels (T )), to find the output layer weight matrix (β) is a common method in ELM [Huang et al. 2004; Kasun et al. 2013] as it offers faster convergence compared to the numerical counterpart. However, realizing the matrix inverse in hardware is cumbersome [Perina et al. 2017] . Rather than using the normal equation, the iterative delta rule algorithm [Jacobs 1988 ] is chosen. In delta rule, a weight β i,j connecting the j th neuron in hidden layer to i th neuron in output layer is updated according to Equation (3), where α is the learning rate, h i,j and (t * i − t p i ) refer to the input and the output error of the i th neuron, respectively.
3. DESIGN METHODOLOGY
Memristive Crossbar Network
In order to perform the matrix-vector multiplication in ELM, a memristive crossbar is used as it enables high-speed computations while maintaining low power consumption and area overhead. Unfortunately, the memristive crossbar structure offers only positive range of synaptic weights. Therefore, two memristors are used to model the synaptic weights in a bipolar manner. Typically, this is achieved either by using two crossbars or one crossbar with dual input (in this work, dual input refers to a signal and its negation) [Alibart et al. 2013; Hu et al. 2014; Hasan 2016; Chakma et al. 2018] . Here, both approaches of realizing the synaptic weights in the memristive crossbar and their constraints will be discussed. Furthermore, in each section, the proposed optimization approach, called semi-trained crossbar, will be investigated.
3.1.1. Two Crossbars topology. In this topology, two crossbars are employed to generate positive and negative weight ranges [Alibart et al. 2013; Hu et al. 2014] . Figure 2 -(a) illustrates the use of two crossbars in emulating the synaptic weights, where each weight value is given by Equation (4). R f is the feedback resistance of the Op-Amp based subtracter, and M + i,j and M − i,j denote the memristor resistance at the crosspoint (i, j) for the left (pink) and right (green) crossbars, respectively. By applying an input
will be generated as given by Equation (5), where β is the synaptic weight matrix which can be calculated using Equation (4).
It turns out that mapping the synaptic weights to two crossbar arrays overwhelm the learning process, as two crossbars need to be trained rather than one. Moreover, a consistent change in the memristors on the positive and negative crossbars must be sustained to ensure network convergence. Therefore, this research suggests a different approach (called semi-trained) to realize the synaptic weights. This approach utilizes only one crossbar associated with one fixed reference line. The reference line can be created either by using memristors or resistors. Figure 2 -(b) illustrates the structure of the proposed approach. On the left side, a memristor crossbar is implemented to emulate the synaptic weights. On the right side, one column (denoted by M − ) of fixed memristors is used such that positive and negative weight ranges are obtained. By adopting this approach, the number of Op-Amps used at the bit-lines will be shrunk to almost half and thereby reduce hardware resources significantly. 3.1.2. One Crossbar topology. One crossbar is used to model the synaptic weights in this topology. However, in this crossbar array, the input needs to be negated to achieve bipolar input-weight matrix-vector multiplication. Figure 3 depicts the schematic of one crossbar structure, where the input vector X p and its negation (∼ X p ) are introduced to the crossbar and are multiplied by the synaptic weight matrix β, as given by Equation (5). When it comes to training such a structure, again all the memristors in the crossbar need to be adjusted. By adopting the semi-trained approach, M − memristors are set to a fixed value, whereas M + memristors are trained to achieve the desired network convergence.
3.1.3. Two Crossbars Vs. One Crossbar:. In spite of the fact that both crossbar topologies are capable of performing the intended function (bipolar matrix-vector multiplication), when it comes to hardware, each approach imposes different constraints. The downside of using two crossbars to emulate the synaptic weights is that the input-weight multiplication is fractioned into two parts. One of them is accomplished via M + crossbar, whereas the second is done in M − crossbar. Due to this separation, additional constraints are imposed on the network input and its weight range. Figure 4 -(a) illustrates a circuit of one neuron with n inputs, the output of the neuron is given by Equation (6) and Equation (7). 
Due to the fact that V x is computed by the first Op-Amp, then its maximum value is always limited to the first Op-Amp biasing voltages (V dd and V ss ), i.e. Equation (8) must be satisfied. Consequently, the input and weight range will be limited as well as the crossbar size. In cases where only one crossbar is used to perform the input-weight matrix-vector multiplication, an additional inverter is needed to negate the input signal. However, in this network the input-weight multiplication will not be segregated. Thus, using only one Op-Amp at the output will suffice, as shown in Figure 4 -b. The output here is given by Equation (9). Since t * i is associated with one Op-Amp, the constraints that we had in Equation (8) are not applied here. Instead, Equation (10) must be satisfied, which infers that every single input feature multiplied by its corresponding weight is evaluated separately to be ≤ (V dd − V ss ). This eventually alleviates the constraints we had when using two Op-Amps. Reducing the input and weight range constraints gives more flexibility when it comes to hardware implementation. Moreover, large crossbars can be realized. Figure 5 demonstrates the input-weight multiplication performed using the neuron circuits from Figure 4 , where all the input features (X n , where n =2) are assigned to 0.3sin(wt), R f = R x = 500kΩ, and M + = M − = 250kΩ. V out2 and V out1 denote the output of neuron-(a) and neuron-(b), which can be computed based on Equation (6) and Equation (9), respectively. Although the output in both cases should be the same (=0v), neuron-(a) gives incorrect output as it violated the constraints in Equation (8) (notice that V x was clipped). This indicates that the neuron circuit in Figure 4 -(b) can handle more input-weight range compared to the former. Figure 4 , where all inputs (X n ) are assigned to 0.3sin(wt), R f = R x = 500kΩ, and M + = M − = 250kΩ. V out1 and V out2 denote the output of one and two crossbar neurons, respectively.
x:8 A. Zyarah et al.
Delta Rule Algorithm
In order to simplify the learning process on-chip, delta rule algorithm, described by Equation (3), is used. However, realizing the delta rule equation in hardware still requires non-trivial resources (subtractor and multiplier). Therefore, this work adopts the simplified delta rule equation used in [Hu et al. 2014] and shown in Equation (11).
Here, the weight will essentially change according to the sign of the gradient and has a fixed learning rate. Although such procedure will slow down the convergence speed, the resources used for learning circuity will be significantly minimized. By applying the delta rule to the semi-trained crossbar structure, the iterative change in weight value ensures network convergence can be computed. Recall that each synaptic weight is emulated by fixed and tuned memristors. By using Equation (4) and Equation (11), the net change in the memristor to achieve the desired weight value can be calculated, as in Equation (12).
SYSTEM DESIGN AND ANALYSIS
The system level architecture of ELM consists of two main layers: hidden and output. In this section, the main focus will be on the output layer as it has similar structure to the hidden layer except that the output layer is integrated to the training circuit due to the need of synaptic weight adjustment. Figure 6 shows the architecture of the output layer which essentially has three parts: memristive crossbar, neuron circuit, and training circuit. The memristive crossbar represents a single layer of ELM in which each column corresponds to one neuron connected to R n (R n is the number of crossbar rows) number of synapses modeled by memristors. The crossbar is responsible for evaluating the input-weight matrix multiplication, whereas the neuron and training circuits are responsible for performing a non-linear transformation on crossbar bit-line outputs and adjusting the weights of the crossbar, respectively. Primarily, the proposed network runs in two phases: inference and learning. During the inference phase, the input vector (
is fetched to the network where it gets multiplied by the synaptic weight matrix (β) to generate the output vector (T * ). The output of the network is evaluated by comparing it to input class label and the difference is reported either as logic '1' denoting that t * i > t i , or logic '0' otherwise. The output of the error computing unit is stored into a shift register to be used in the learning phase. In the learning phase, the memristor resistances are adjusted according to the sign of the gradient and learning rate. In this work, the tuning of memristor is done column-by-column (training each column takes two clock cycles) through a modified Ziksa training circuit , which is modeled by +T r and −T r. Ziksa is used to form an H-Bridge across the memristors that require tuning and by allowing the current to flow through the device in both directions, bipolar weight change can be achieved (more details is provided section 4.1). It is important to mention here that each Ziksa unit is controlled by local row and column units which in turn are controlled by a global controller. The global controller determines when to enable the inference and learning phases and is responsible for synchronization. In the following subsections, each unit in the system architecture is discussed in detail. 
Ziksa: Training Circuitry
Recall the simplified delta rule algorithm suggests a fixed adjustment in the memristive weight. In the adopted memristor model, the changes in memristive state variable are dependent on the current flowing in the device as given by Equation (13) [Kvatinsky et al. 2013 ]
where w is the memristor state variable. k of f , k on , α of f , and α on are constants, i of f and i on are the memristor current thresholds, and f on and f of f describe the device window function. As the memristor exhibits properties similar to that of a resistor, the current flowing in the device will be limited by its resistance. Thus, to satisfy the learning rule constraints, in this work, we modified our previous design of Ziksa to accommodate this issue. Fig. 7 : Ziksa unit for adjusting a threshold-current memristor. The unit is mainly composed of four transistors that sandwich the memristor to allow the training current to flow bi-directionally through the memristor. This allows its value to be incremented and decremented. Figure 7 illustrates the modified version of Ziksa in which two current mirrors are used to limit the amount of current flowing in the memristor which consequently ensures consistent adjustment of memristor resistance. The circuit works as follows: during the first clock cycle of learning, which involves increasing the weight values (this means decreasing the memristor resistance as β i,j ∝ (1/M i,j )) in a selected column, current will be provided to the memristor via T 5 . To ensure a fixed change in the memristor value, this current will be limited to ≈ I − by using a current mirror created by T 1−2 on the other terminal of the memristor. During the second cycle of training, the weight will be decremented by allowing the current, ≈ I + , to flow in the opposite direction.
Practically, fixing the current through the memristor is a difficult condition to meet with the current technology limitations. However, there is still a possibility to limit the variation in the current through memristor while changing its state via a cascode current mirror. The variation in the current through memristor in regular and cascode current mirrors when using I ref = 4 µA is depicted in Figure 8-(a) . The corner analysis evaluation while considering the fabrication process, ambient temperature, and supply voltage variations is shown in Figure 8 -(b).
Column and Row Local Control Units
Ziksa transistors are driven by local control units associated with each column and row. Recall that Ziksa has four transistors that form an H-Bridge sandwiching the memristors in the crossbar. Half of the H-Bridge transistors reside in +T r and the other half in −T r. Each −T r is controlled by its associated column controller, shown in Figure 9 -(b). The column local control unit consists of a combinational logic circuit that drives the T 5 and T 6 transistors of −T r. When the learning phase begins, the column controller receives two signals: ColEn and P olar. The former signal determines whether a column is selected for training, whereas the latter refers to the system training cycle which can be either positive or negative. During the positive cycle of training, i.e. P olar = '0', all the weights that need to be incremented are adjusted, while the weights required to be decremented are tuned in the negative cycle of training. When a column is enabled by ColEn, T 5 and T 6 transistors of −T r will be controlled in an alternating way. During the low cycles of the P olar signal, P T is set to low to enable transistor T 5 , whereas T 6 is set to off via N T . This allows the current to flow towards the end-terminal (a) (b) Fig. 8: (a) The variation in the current supplied to the memristor while sweeping its value between its low and high resistance states. (b) The variation in the current through the memristor during corner analysis.
of the crossbar row to increase the weight and vice versa for the high period of the P olar signal. In case of +T r, its transistors are controlled via the formed current mirrors with transistors T 1 and T 4 . The output of +T r is connected to a tri-state gate controlled by the row controller, illustrated in Figure 9 -(a). It turns out that the row controller is more complex compared to the column controller because the gradient sign of delta rule is evaluated here via the input signal (input) and computed network error (Error). Based on the gradient sign, the memristor resistance will be either incremented or decremented. However, this is carried out when the training process is enabled via T rEn. According to the P olar signal state, the memristor resistances are modulated during the appropriate training cycle. It is important to mention here that the ColEn, P olar, and T rEn signals are provided to the local controllers via the global controller. 
Global Controller
All the layers in the proposed system architecture are controlled by a global controller which takes care of data flow and unit synchronization. The global controller runs in three main states. In the first state (Read), the global control unit enables the pass transistors to allow the input signals to propagate through the crossbar to perform input-weight matrix multiplication. Once this is done, the output of the network will be captured and the next state (T rain C1) starts. In this state, the first round of training, positive cycle, is performed. Thus, the signals that control the local controllers must be generated. For the column controller, if a column is selected for training, its associated ColEn is set to '1' whereas P olar signal is set to low indicating the positive cycle of training. In case of the row controller, T rEn is set to high, which in association with the input and computed error signs, determines the synaptic weights that need to be incremented. When the global controller runs into the third state (T rain C2), the negative cycle of training will commence, and the same process from the previous cycles will be repeated except that the P olar signal is set to high. The global control unit keeps moving between the second and third states until all the columns of the crossbar are trained. Figure 10 is an algorithm state machine chart demonstrating the transition between the states and the output of each one. 
EXPERIMENTAL SETUP
The Verilog-A memristor model proposed by [Kvatinsky et al. 2013 ] is employed in this work. The memristor value is set based on the device parameters described in [Fan et al. 2014; Kawahara et al. 2008] such that it meets the network and technology constraints, shown in Table I . The high resistance state (HRS) of the memristor is selected to be 250 kΩ, so that the voltage developed across the memristor does drive the current mirror transistor out of saturation. On the other hand, to keep the power dissipation through the crossbar network as minimum as possible, the low resistance state (LRS) is set to 100 kΩ. Recall that the proposed system runs in two phases: inference and learning. In the inference phase no change in memristor values should occur. Since a threshold-current memristor is adopted to model the synaptic weights, the current through these devices should never exceed the device threshold to avoid undesired changes in the memristors. Unlike the inference phase, during the learning phase, the current must cross the device threshold-current to adjust the device resistance. According to the experimental setup used in this work, there is a variation in the current through the memristor during the inference phase. However, this variation is estimated to be ≈0.6 µA. Thus, by using a current of 4 µA during the learning phase and limiting the input current during the inference phase to be 3 µA, no overlap between the two phases can occur.
The following constraints must be fulfilled for the crossbar pass transistors 1 :
-Limit the input voltage such that V DS <<| V GS − V T h |. This ensures that the transistor is working in the triode region and an undistorted signal reaches the memristors. -Assume that K(V GS −2V T h ) >> G mem (w) such that the conductance of the transistor is higher compared to the memristor conductance. Thus, the voltage at the memristor is ≈ input voltage [Soudry et al. 2015] . 
EXPERIMENTAL RESULTS AND NETWORK VERIFICATION

Network Verification
In order to verify the operation of the proposed design, each unit in the network is simulated independently and within a network in Cadence Spectre environment. Then, the same network is emulated in MATLAB and simulated for classification application under the same circuit constraints but with different configurations. The benchmarks employed in this work are selected from UCI library and chosen to be binomial (Diabetes and Australian Credit) and multinomial (Iris). This is added to the multi-class standard hand-written digits dataset, MNIST. Figure 11 depicts the weight distribution of the output layer when using MNIST dataset and the achieved accuracy for each dataset during the training and inference phases. Furthermore, the variation in accuracy that may occur due to the process variation of memristors 2 and random weight initialization is shown (variation for 5 iterations, each iteration is averaged over 10 runs). Table II shows a comparison of the achieved accuracy with previous ELM implementations for the same datasets. As can be noticed, although the proposed work offers lower classification accuracy, it has a simpler network as the number of hidden neurons is much lower. However, the performance degradation in the network primarily can be attributed to the input voltage and the weight range constraints imposed on the network. These constraints are due to the memristors limited resistance range and the neuron biasing voltage. 
Network Scalability
In order to estimate the resources needed to map a large neural network to the proposed design, a full custom design of small-scale (4x4) single layer network is implemented in Cadence using IBM 65nm technology node. Figure 12 shows the exponential scaling of a single layer neural network from 2x2 up to 128x128, which tends to be linearly proportional to the total transistor count. 
Power Consumption
The total power consumption of the proposed design is estimated for a 4x4 single layer neural network. The power consumption is evaluated in Cadence Spectre environment while running the system at 100 MHz. Considering the worst case scenario, (all crossbar memristors are set to low resistance state), the power consumption is estimated to be 40 µW for each 4x4 crossbar and 0.63 µW for the digital circuit (the Op-Amp power is not considered). It is important to mention here that the power consumption in a memristor while keeping the device resistance unchanged is assumed to be similar to that of a resistor [Marani et al. 2015] . Table III shows the comparison of power consumption while using different crossbar architectures and different approaches of realizing the synaptic weights. The power consumption of each architecture is achieved by covering all the input combinations and averaging the results. It can be noticed that the semi-trained two-crossbar architecture offers the minimum power consumption as it is compact and uses almost half the number of memristors compared to other designs. 
CONCLUSIONS
This paper investigates a new approach for realizing positive and negative synaptic weights in a crossbar structure using threshold-current memristors. The proposed approach relies on one memristive crossbar to model the weights and uses additional fixed (untrained) columns to generate bipolar weights. Moreover, the paper presents an updated version of the on-device training circuit, Ziksa, which can be used to modulate current-threshold memristors in a crossbar structure. The proposed network is tested for classification applications with binomial and multinomial datasets while considering memristor device variations. It is found that the process variations have limited effect on the network performance for large datasets. In these cases, the network has better generalization. In scenarios where power efficiency is a constraint the semi-trained network for two crossbar topology is preferred. Future work will investigate the network performance while considering crossbar resistance, noise effect, and other process variations.
