ABSTRACT Emerging Magnetic Random-Access Memory (MRAM) has shown a great potential to replace Static-RAM (SRAM) and Dynamic-RAM (DRAM) in the working memories including Cache and main memory. MRAM benefits from its high-density, fast speed, low standby power, and non-volatility, which provides the possibility to accelerate the computation. In this paper, we present a transpose MRAM (T-MRAM) based on the current MRAM family (Spin-Transfer Torque or STT, Spin-Orbit Torque or SOT, -MRAM) to support read/write operations both in vertical and horizontal dimensions. More importantly, a multi-port 1-Read-1-Write (1R1W) scheme is developed for the proposed T-SOT-MRAM. Thanks to the ultra-fast speed and high power-efficiency, the proposed T-SOT-MRAM can be used to accelerate the computation of Convolution Neural Networks (CNNs) and video coding. The functional validation is performed under a hybrid CMOS/MRAM simulation. According to the evaluation results, the proposed T-SOT-MRAM obtains 1.8× speedup and 28% energy saving of 1R1W operations, compared to the baseline. In addition, T-MRAMs consume less leakage power consumption than the other counterparts. Furthermore, the smaller cell area of the proposed T-MRAMs allows increasing the capacity in order to improve performance efficiency.
I. INTRODUCTION
Transpose memory has been widely used for the computation of spiking neuron networks (SNNs), the video coding, and the Convolution Neural Networks (CNNs) [1] - [3] . In SNNs, the T-memory was used to support both row and column accesses for pre-synaptic row and post-synaptic column updates, respectively [1] . T-memory enables single-cycle to write and read access in both row and column dimensions to efficiently enhance on-chip learning. Then, in the highefficiency video coding inverse discrete cosine transform (IDCT), the T-memory was developed to reduce costs and to improve the throughput [2] . In addition, the popular convolution neural networks (CNNs) has employed the T-memory to support efficient data transformation and data reuse [3] .
The associate editor coordinating the review of this article and approving it for publication was Sourabh Khandelwal.
Normally, the memory cell and block of transpose memory require additional transistors (i.e., seven or eight transistors) compared to the standard memory cell (i.e., six transistors) of static-RAM (SRAM) to support the cell-level and block-level transpose read/write operations. However, the large memory cell and additional transistors of SRAM induce large areaoverhead and high leakage power consumption. In addition, the conventional transpose SRAM only support 1-read (1R) or 1-write (1W) operation during the computation. The modified SRAM cell can support the 1R1W operation but at the expense of discarding the transpose operation [4] .
Recent developments of magnetic RAM (MRAM) promote advanced low-power memory architectures [5] . In the MRAM family, the spin-transfer torque MRAM (STT-MRAM) has become one of the most popular candidates for on-chip and off-chip application [6] - [8] . The typical one-transistor one-resistor (1T1R) bit-cell structure of the STT-MRAM is shown in Fig. 1(a) , where both read and write currents flow through the memory element (i.e., magnetic tunnel junction or MTJ). As an alternative to the STT-MRAM, emerging spin-orbit torque MRAM (SOT-MRAM) has been paid more and more attention due to faster access speed and lower energy dissipation [9] - [13] . Furthermore, read and write paths are separated in the SOT-MRAM (see Fig.1(b) ), which provides more flexibility in the design methodology [14] . Typically, the MRAM array only supports the regular access row by row. By repurposing the peripheral circuits, a specific memory array of MRAM can perform the transpose operation beyond data storage.
In this paper, we propose a transpose MRAM (T-MRAM) to support access operations both in the row-dimension (horizontal) and column-dimension (vertical) with the areaand energy-efficient memory cells. Both STT and SOT based T-MRAMs are developed to implement the proposed transpose operation. More importantly, a multi-port 1R1W scheme is developed for the T-SOT-MRAM to improve the access efficiency of the memory array. Potentially, our proposed 1R1W T-MRAM can be used to accelerate the computation of matrix transpose and circulant matrix-based CNNs [3] . We discuss a variety of usages of the proposed T-MRAM by performing hybrid CMOS/MRAM simulation. According to the evaluation results, the proposed 1R1W T-SOT-MRAM can achieve 1.8× speedup and 28% energy saving of 1R1W operations, compared to the 8-transistor (8T) T-SRAM baseline.
The remainder of this paper is organized as follows: Section II provides the background of the transpose memory and SOT-MRAM. Section III gives the design of the array structure and operations for the proposed T-MRAM. The experiment setup and evaluation results are demonstrated in section IV. Finally, section V concludes this work. 
II. BACKGROUND
This section provides an overview of the transpose memory (T-MRAM), as well as the basics of SOT-MRAM, is introduced.
A. TRANSPOSE MEMORY
Conventionally, the Transpose memory (T-memory) was developed based on the SRAM. The additional transpose transistors are added to the standard memory cell for enabling the column-based (horizontal) operations, as shown in Fig. 2 . In Fig. 2 (a) , two transpose transistors are connected to vertical bit-line (BL and BL T ) and the additional word-line (WL T ) is used to control those two transpose transistors [1] . Fig. 2 (b) provides the memory cell of the 7T transpose SRAM, where a decoupling transistor is added for conventional and transpose read operations [15] . Recently, a 6T block-wise transpose SRAM was developed. Reference [3] two additional transpose transistors are required to assist the transpose operation at the block level. Works mentioned above were based on the SRAM technology, which requires more transistors and consumes high leakage power. Besides, the capacity of the transpose SRAM is limited by area overhead due to the large memory cell [1] , [16] . According to the recent literature, the influence factor of the performance for the T-memory contains the number of transistors of the transpose cell, transpose scheme, and supported function [3] . Those factors influence the performance of the T-memory in terms of array size, delay, and energy dissipation. Inspired by the circulant-matrix CNNs and high power-efficiency of MRAM, we present the T-SOT-MRAM where the memory cell consists of only two transistors to support the transpose operation. With the modified structure, our proposed T-SOT-MRAM can perform multi-port 1R1W operation thanks to the ultra-fast write operation of SOT-MRAM.
B. CIRCULANT CNNS
Traditional deep CNNs are both compute and memory intensive. Even though several approaches have demonstrated good performance, including the weight quantization and pruning techniques [17] , [18] , CNNs still suffer from limitations in terms of the irregular network structure, increased training complexity, decreased inference accuracy [19] . A block-circulant matrices-based CNN, namely CirCNN, was proposed to overcome those limitations, which utilizes the Fast Fourier Transform (FFT)-based fast multiplication [15] , [19] . The forward propagation process in the inference phase is expressed by,
where Y i represents a column vector. W ui is a vector of the weight parameters, where
After that, the calculation of W ui Xi can be performed as IFFT (FFT (w ui )⊗FFT (X i )), according to the circulant convolution theorem [20] . The detailed algorithm to implement the inference and back-propagation processes can be found in [19] . In the calculation, the ⊗ denotes the element-wise multiplication, which can be accelerated by the transpose memory to improve the opportunity of data reuse [3] . During the transformation for the data reuse, 128 × 1 bit serial FFT data is horizontal written into the memory array and the read out data is vertically read from the same array to accelerate the computation. In this paper, we employ the SOT-MRAM to replace the conventional SRAM-based T-memory by utilizing advantages of the SOT-MRAM in terms of high-density, high read/write speed and low power consumption. Implementing the full function of the CirCNN is beyond the scope of this paper, which will be our future work.
C. SOT-MRAM
SOT-MRAM is featured by high speed and energy-efficiency, where the storage element consists of a magnetic tunnel junction (MTJ) above a heavy metal (HM) strip (SOT-MTJ, as shown in Fig. 1(b) ). The key structure of the MTJ is composed of a tunnel barrier sandwiched between two ferromagnetic layers, which, named pinned and free layers, have fixed and switchable magnetizations, respectively. The resistance of MTJ can be set to low or high values by switching the magnetization of the free layer to be parallel (P) or anti-parallel (AP) to that of the pinned layer. The magnetization switching is achieved by the SOT, which is induced by applying an in-plane current to the heavy metal layer [21] - [25] . Recent works demonstrated that both the spin Hall effect or Rashba effect could contribute to the SOT. Combined with a magnetic field or exchange bias [26] , [27] , the SOT can result in deterministic switching in the perpendicular MTJ at high speed and low energy dissipation. In the bit cell of the SOT-MRAM, two transistors are required rather than one transistor in the STT-MTJ, as shown in Fig. 1 (a) and (b). Despite using an additional transistor, the bit cell of SOT-MRAM only induces slightly area overhead since the required write current is much lower than that of in STT-MTJ. In addition, the read and write paths are separated in the SOT-MTJ, which is suitable for developing the multi-port MRAM [28] , [29] . Based on the transient analyses shown in Fig. 1 (a) and (b), STT-MTJ requires a long incubation delay to overcome the thermal barrier. The SOT-MTJ can eliminate the incubation delay thanks to the high efficiency of the SOT. The critical parameters of both STT-MTJ and SOT-MTJ are demonstrated in Fig. 1 (d) . Fig. 1 (c) indicates the sense amplifier circuit of the SOT-MRAM, including the two branches to connect the memory cell and reference cell. The reference cell is used for comparison with the memory cell, where the resistance value of the reference cell is normally (R P +R AP )/2 or √ R P × R AP . Here, the reference cell can be resistance, R P and R AP are the resistances of the MTJ at P and AP states, respectively. In our work, we proposed high speed and high power-efficiency T-SOT-MRAM structure to support the transpose and 1R1W operations.
III. TRANSPOSE SOT-MRAM ARCHITECTURE
This section gives an overview of the proposed transpose SOT-MRAM (T-SOT-MRAM) based on the hierarchical BL and SL switching. Also, the detailed transpose memory cell (trans-cell) design is introduced with the operations. Furthermore, we discuss a multi-port 1-read-1-write (1R1W) scheme to improve the access efficiency of the proposed T-SOT-MRAM. Fig. 3 indicates the conventional STT-MRAM and SOT-MRAM architectures, which consist of control logic and memory array. The control logic contains memory controller, write driver, X-decoder, Y-decoder, SA and output driver. In the memory array, there are BL, SL, RWL and WWL (or WL of STT-MRAM) for connecting the memory cells. Compared to the STT-MRAM shown in Fig. 3(a) , where the WL control both read and write operations, the SOT-MRAM shown in Fig. 3(b) requires the RWL and WWL for controlling read and write operations, respectively. Naturally, a large memory array is divided into several subarrays to improve the stability, and global BLs are employed to communicate with that local BLs. Besides, SA can be shared by numerous columns with the multiplexer selecting the activated columns. These structures only support horizontal read operation, in which the data is read out row by row. Beyond these structures, we design the T-SOT-MRAM array by adding transistors into the sub-array to support the vertical read operation. Fig. 4 (a) shows the architecture of the proposed T-SOT-MRAM array, which contains horizontal I/O (HI/HO), additional vertical I/O (VI/VO), fine-grained sub-arrays, and vertical and horizontal RLs (VRL and HRL). The VI/VO supports the transpose operation by activating the VRL and corresponding RLs. The HRL and HI/HO are used for the normal access operation. With this T-SOT-MRAM, both normal and transpose access operation can be supported to accelerate the computation for the transpose and circulant matrices. In the following, we introduce the T-SOT-MRAM in more details. It is worth mentioning that the T-STT-MRAM could be designed with a similar structure. Fig. 4 (b) shows the sub-array structure of the T-SOT-MRAM, which contains several memory cells and a Trans-cell. The Trans-cell provides normal and transpose access operations controlled by the HRWL and VRWL. HGBL and VGBL are connected to the HSA and VSA, respectively.
A. T-SOT-MRAM ARRAY

B. DESIGN AND OPERATION OF THE SUB-ARRAY
1) NORMAL ACCESS MODE
In the normal access mode, the data are accessed row by row. One RWL or WWL is activated to select one row. HRWL and VRWL corresponding to the selected row are also activated for read and write operations. During the read operation, the RWL and WWL are activated and deactivated, respectively, so that a discharge current of the HSA flows from HGBL to local BL and through the SOT-MTJ (red dashed line in Fig.4) . During the write operation, the WWL and RWLare activated and deactivated, respectively. HGBL is disconnected from the HSA, and is applied to a positive or negative voltage, depending on the polarity of the written data. Meanwhile VGBL is connected to the ground so that a current passes through the heavy metal layer from local BL to SL or from local SL to BL. Then binary data are written into the memory cell.
2) TRANSPOSE ACCESS MODE
In the transpose access mode, the data are accessed column by column. One VRWL is activated to select one column. The RL, WWL and HRWL corresponding to the selected column are also activated. Note that only one cell whose RL (or WWL) and VRWL are both activated can be accessed. In other words, within the selected column, only one cell in each sub-array is activated. During the read operation, the RL and WWL are activated and deactivated, respectively, so that a discharge current of the VSA flows from VGBL to SL and through the SOT-MTJ (blue dashed line in Fig. 4 ). During the write operation, the VGBL is disconnected from the VSA and is applied to a positive or negative voltage. The write current flows through the heavy metal layer and programs the binary data into the memory cell, based on the same principle as the normal access mode.
C. MULTI-PORT 1R1W SCHEME
During the normal and transpose access modes of the T-SOT-MRAM, only 1R or 1W can be performed owing to the shared BL structure shown in Fig. 3 (b) . As SOT-MTJ enables independent write and read operations, an array structure with separated BLs can be developed for the 1R1W operation within one clock cycle, as shown in Fig. 4 (c) .
1) ARRAY MODIFICATION
The original BL is classified into WBL and RBL for write and read operations, respectively. We tag the column as odd and even to distinguish the selected and unselected column as indicated in Fig. 4 (c) .
2) CONTROL LOGIC MODIFICATION
For supporting the 1R1W operation, the control logic of the memory array should be modified to provide fast switching; meanwhile, an additional mask logic is required. Firstly, decoders are modified to allow multi-row and multi-column activations within the same clock cycle. Moreover, the mask logic is designed to ensure that read and write operations cannot simultaneously occur in the same selected row or column. Then, the SA and write driver are respectively connected to RBL and WBL, so that read and write operations could be independently controlled.
3) PRINCIPLE OF OPERATION
For simplifying the control logic, the read and write operations are respectively performed in the odd and even columns, or vice versa. Thanks to the SOT-induced ultra-fast switching, read and write operations can be completed within one clock cycle, as shown in Fig. 5(c) . Initially, the negative signal precharges the SAs. Then a short read-enable pulse is triggered at the positive edge of the clock signal to sense the stored data of the selected odd column. Afterward, a write-enable pulse is applied to the WWL and activates one row. As a result, a current flows through the heavy metal layer between the WBL and SL of the selected even column, resulting in the SOT to program the memory cell. The programmed data is dependent on the direction of the current, and hence dependent on the applied voltages of WBL and SL. The same operations are also suitable for the transpose operation.
IV. EVALUATION RESULTS AND ANALYSES
This section provides the evaluation setup and functional validation results of the proposed T-MRAMs architectures. Besides, the results and analyses on the performance comparisons are discussed.
A. EVALUATION SETUP
In our evaluation, both STT-MRAM and SOT-MRAM are employed to build the T-MRAM based on a device-to-circuit framework. In the device level, we employ Verilog-A models of the STT-MTJ and the SOT-MTJ to validate the write and read operations [25] , [30] . In the circuit level, we develop the memory with the 1-transistor-1-STT-MTJ and 2-transistor-1-SOT-MTJ bit-cells to perform the validation of the normal and transpose access operations under STM CMOS 32 nm technology. Also, the multi-port 1R1W operation is validated with a 4 × 4 memory array. The precharge S.A. (PCSA) is chosen as the H.S.A. and V.S.A. to validate the 1R1W function [31] .
For evaluating the performance of the T-MRAM architecture, we employ a modified NVSim simulator to estimate the area, latency, and energy of both T-STT-MRAM and T-SOT-MRAM [32] . In the NVSim simulator, we modify the original MRAM device-description file by adding the evaluated results of a single memory cell for the STT-MTJ and SOT-MTJ. We add the necessary control logic and additional VSA as described in Fig. 4 . The structure and array files are modified to add each component. The area, latency, and power of additional components are evaluated with the Synopsys Design Compiler under standard industrial library. The area optimization option is used as the design target. The 8T and 7T cell-wise and 6T block-wise transpose SRAMs are chosen as our baseline [1] , [3] , [15] . For the aim of a fair comparison, the capacity of the memory array is set to 32 Kbit. Fig. 5 (a) and (b) demonstrate the simulation results of the transpose operation based on the sub-array shown in Fig. 4 (b) . The CLK, m z , HRWL, and VRWL are the clock signal, perpendicular component of the unit magnetization in the free layer of the MTJ (1 and -1 corresponds to P, and AP states, respectively), horizontal and vertical read lines signals, respectively. The Hout and Vout are the outputs of horizontal and vertical SAs, respectively. We also detect the current flowing through the SOT-MTJ and the reference cell represented by the terminal memory cell (TC), terminal reference vertical cell (TRV ), and terminal reference horizontal cell (TRH ). The read operation is performed on AP and P states, respectively. In Fig. 5 (a) , the state of the SOT-MTJ is AP (m z = −1). The normal read mode is performed, and the Hout gets the correct result. Then the same result is obtained at the Vout in the transpose read mode. Similarly, the read results of the P state are validated in Fig. 5(b) . These validation results indicate that the proposed transpose operation can be successfully performed in the sub-array. In addition, the simulation results also demonstrate that the read operation is completed at ultrafast speed. After a precharge process of the PCSA, only 0.3 ns is required to sense the state of the memory cell. Fig. 5 (c) demonstrates the validation results of the multiport 1R1W operation, where RLo, WBLe, and WWLe are the signals applied to the RWL of the selected odd row, WBL of the selected even column and WWL of the selected even row, respectively. In the simulation, the read operation is followed by a write operation. Thanks to the high-efficiency SOT, the write latency could be as short as 0.7 ns, consistent with the reported experimental results [33] . For the read operation, a memory cell of the odd column with AP state is correctly read out at the Hout. For the write operation, a memory cell of the even column is switched from P to AP state, as indicated by the waveform of m USW . The simulation results demonstrate that multi-port 1R1W operation can be completed within one clock cycle of 2 ns, supporting a frequency of 500 MHz. These results promise to be improved in the future with the progress in the SOT technologies.
B. FUNCTIONAL VALIDATION OF TRANSPOSE AND 1R1W OPERATIONS
C. RESULTS AND ANALYSES
Based on the above circuit-level results, the performance of the proposed T-SOT-MRAM is evaluated with a modified NVSim simulator. Table 1 lists the evaluation results including array size, delay and energy dissipation. For the number of cell transistors, both the T-STT-MRAM (T-STT) and T-SOT-MRAM (T-SOT) outperform the T-SRAM solutions since the memory cell of SRAM is based on 6T cell structure. Even in the block comparison, T-STT and T-SOT only use 10 and 18 transistors, respectively, which is much less than the state-of-the-art 6T T-SRAM counterpart. Note that the cell area is dependent on the transistors since the nanoscale MTJ can be staked above the CMOS circuits and does not induce additional area overhead. It is also important to mention that the 7T T-SRAM only supports the transpose read operation, whereas the proposed T-SOT-MRAM is more advanced since it enables both the transpose read and write operations. Furthermore, among the technologies listed in Table 1 , only the proposed T-SOT-MRAM allows the multi-port 1R1W operation due to the separated read and write paths, which significantly improve the speed and energy efficiency.
For the performance comparison, we chose the 8T T-SRAM as the baseline. All other counterparts are normalized to this baseline.
1) ARRAY SIZE
According to the comparison results, the 6T T-SRAM occupies the least array size owing to the simple control logic and a high proportion of the sub-array area. In 6T T-SRAM, more than 70% of the total area is occupied by the sub-array, while most of the area for T-SOT-MRAM is generated by the control logic including S.A.s and decoders.
2) DELAY
For the delay comparison, the 7T T-SRAM requires the longest delay for the normal access and transpose access. Both 6T T-SRAM and T-SOT-MRAM can achieve smaller delay than the 8T T-SRAM thanks to their simple sub-array structures. The read delay of the T-STT-MRAM is comparable to that of the T-SOT-MRAM or 6T T-SRAM. However, the write delay of the T-STT-MRAM is much higher than those of the other technologies due to the intrinsic incubation delay of the STT. Remarkably, by introducing the multiport 1R1W operation, the T-SOT-MRAM can obtain 1.8× and 1.6× speedup compared with the 8T T-SRAM and 6T T-SRAM, respectively.
3) ENERGY
Both 6T T-SRAM and T-SOT-MRAM can consume less energy than the baseline thanks to the simple sub-array structures and control logic. T-SOT-MRAM can save 28% and 7% energy dissipation by using the multi-port 1R1W operation, compared to the 8T and 6T T-SRAMs, respectively. Therefore, our proposed multi-port 1R1W operation scheme improves not only the speed but also the energy efficiency of the T-SOT-MRAM.
4) LEAKAGE POWER
In the comparison of the leakage power consumption, the 6T T-SRAM is used as the baseline. Compared to the 6T T-SRAM, both T-STT and T-SOT MRAMs consume less leakage power owing to the non-volatility of the MTJ. The leakage power of the T-MRAMs mainly originates from the control logic, and thus it could be further decreased if the capacity scales up, as demonstrated by the previous work [9] .
D. DISCUSSION
The above results are obtained with a small capacity (32 Kb) for providing a fair comparison with the previous works [1] , [3] , [15] . In the case of small capacity, the above results indicate that the performance of the T-SOT-MRAM is comparable to that of the 6T T-SRAM. In addition, our proposed multi-port 1R1W operation scheme could bring considerable improvements in the speed and energy dissipation of the T-SOT-MRAM. While in the case of large capacity, it is expected that the T-SOT-MRAM significantly outperforms the T-SRAMs. The reason is that the negative influence of the peripheral circuits can be offset by the high-densityintegrated memory array thanks to the more compact bitcell structure of the MRAM. In this paper, we provide the architecture of the proposed T-SOT-MRAM and the transpose operation. In our future work, we plan to use those T-SOT-MRAM to design a full accelerator to implement the CirCNN. We believe that the performance of the CirCNN can be further improved by using the proposed T-SOT-MRAM thanks to the high-efficiency of the transpose and fast multi-port operations.
Typically, the process variation is a vital factor that influences the reliability of the SOT-MRAM. It influences both the read and write operations [34] , [35] . In the proposed T-SOT-MRAM, we add the additional SAs to the vertical edge of the memory array. The read operation is more critical than the write operation since the normal write operation (row access) is usually used in the application such as CirCNN. Fortunately, we employ a high-reliable read-circuit, which has been validated by many previous works of literature [30] , [31] , [36] , [37] . In the future, we will investigate more reliability issues on the T-SOT-MRAM to enhance both the read and write operation in terms of normal access mode and transpose access mode.
V. CONCLUSION
Emerging MRAM family have demonstrated a great potential to implement the working memory such as Cache and main memory due to the high speed and low power consumption. In this paper, we developed the multi-port 1R1W T-SOT-MRAM by taking advantage of the hierarchical BL switching, and ultra-fast SOT-induced write operation, in order to improve the read/write efficiency. The functionality of the proposed T-MRAM was validated through the circuit-level simulation. The array-level evaluation results indicated that the proposed T-SOT-MRAM outperforms the baseline in terms of array size, delay, and energy consumption. Our proposed multi-port 1R/1W T-SOT-MRAM promises to be applied to the circulant matrixbased CNNs, where high-speed and low-power transpose access is indispensable. 
