Abstract-An array-level evaluation of magneto-electric random-access memory (MeRAM) is conducted by comparing its performance with that of other embedded technologies. We consider MeRAM cells with one transistor and one magnetic tunnel junction (1T-1MTJ) structure, where writing of the two-terminal MTJ bit is performed by precessional reorientation of the magnetization via voltage control of the magnetic anisotropy. We consider an accurate resistance-capacitance load on the critical path by including capacitive and resistive effects on the bit lines, word lines, and source lines, because the access time and energy consumption are strongly affected by the parasitics. We then estimate the write access time, read access time, write energy, and area of each memory technology based on 28 nm complementary metal-oxidesemiconductor model parameters under two different conditions: (i) fixed array capacity (512 × 512 bits = 256 Kbits) and (ii) fixed array area (200 µm × 200 µm). We discuss the tradeoffs and advantages of MeRAM compared to embedded SRAM, embedded DRAM (eDRAM), stand-alone DRAM, and embedded spin-transfer torque magnetic random-access memory.
I. INTRODUCTION
As many applications in computer vision, speech recognition, autonomous driving, and cyber-security have started to adopt machine learning algorithms, there is a growing demand for high throughput in computational systems [Pan 2010 ]. For the past decades, the performance of processors and each memory layer have independently improved through scaling, increasing the throughput within Von Neumann architectures.
However, improvement through scaling has saturated due to the bandwidth limitation of the system bus between the working memory (e.g., DRAM) and the processor [Loh 2008 ], thus unable to meet the demands of recent applications. This issue is exacerbated by the trend of moving workloads from the cloud to the edge of the network, requiring high-performance and low-power local solutions where offloading data to a server is not efficient. As a result, the semiconductor industry has begun to incorporate new system architectures for further performance improvements. High-bandwidth memory and hybrid memory cube techniques, as shown in Fig. 1(a) , shorten the physical distance between working memory and processors and increase the number of channels by putting them together into a single package. Although this approach has achieved hundreds of GB/s bandwidth in stacked 8 GB DRAMs [Jeddeloh 2012; Lee 2014] , the latency of signal transmissions via the interposer, ranging from 50 to 200 cycles depending on the network topology [Akgun 2016] , is still longer than that of on-chip data transfer (4-60 cycles), as shown in Fig. 1(b) .
Another way of improving the throughput of a system is increasing the capacity of the processor's on-chip cache memory to reduce the cache miss rate. When the processor does not find needed data in the cache, it allocates a memory space and fetches the required data from the main memory. This process typically takes hundreds of system clock cycles, thus acting as a bottleneck to increasing throughput. State-of-the-art processors have a few tens of MB L3 cache. However, the large area overhead of SRAM prohibits further increasing the cache capacity.
In addition to the demand for throughput, the need for ultralow power electronic systems has also skyrocketed with the emergence Fig. 2 . Memory hierarchy in a conventional computer architecture with required system-level and device-level memory performance.
of IoT, wearable devices, and implantable medical devices. However, leakage current has made conventional volatile memory such as SRAM and DRAM extremely power-hungry components. This issue has become especially severe in advanced processes. Therefore, researchers have developed standby power reduction schemes (multiple power domains, reduced frequency, body biasing, etc.) and have used energy harvesting (vibration, temperature differentials, light, etc.) to alleviate power-related issues [Hulfang 2004 , Tan 2011 . While these approaches are effective to some extent, they cannot completely eliminate leakage and often incur additional overheads.
Integrating a high-density nonvolatile memory into a processor has the potential to improve throughput and energy efficiency significantly, while at the same time reducing chip area and cost [Smullen 2011 ]. The improvement comes as a result of the following.
1) On-chip data transfers are faster by one order of magnitude and more energy efficient by a factor of 100, compared to off-chip data transfers with large capacitive loads (∼1 pF) of input/output (I/O) pads and off-chip wire connections. 2) It is physically more feasible to expand the number of channels between different memory layers by using on-chip metal lines in the embedded memory compared to off-chip wires. 3) Higher memory capacity decreases the cache miss rate, effectively increasing the throughput of the system. 4) Nonvolatility can reduce the total system power consumption via zero standby power. Among several types of emerging memory technologies with the potential to be integrated into an embedded system, magneto-electric random-access memory (MeRAM) is the strongest candidate, due to its CMOS-process compatibility and the device characteristics fulfilling the endurance and read/write time requirements of the embedded memory, as shown in Fig. 2 [Wang 2015] .
However, to date, there have been no studies evaluating the performance of MeRAM at the integrated array level, which may greatly differ from the single-device characteristics. A systematic comparison to other existing or emerging embedded memory is also needed. This letter addresses these questions as follows. Section II presents a brief overview of the voltage-controlled magnetic tunnel junction (VC-MTJ) device and its integration into MeRAM. Section III compares the array-level performance of MeRAM with that of SRAM, DRAM, eDRAM, and STT-MRAM based on the 28 nm node. The evaluation is conducted under two configurations: (i) same array capacity and (ii) same array area. Section IV concludes the letter. 
II. MAGNETO-ELECTRIC RANDOM-ACCESS MEMORY

A. Voltage-Controlled Magnetic Tunnel Junction
A VC-MTJ is the memory element in the one-transistor and one-MTJ cell structure of MeRAM shown in Fig. 3 . The top and bottom electrodes of the VC-MTJ are connected to the bit line (BL) and the drain of the access transistor, respectively. Typically, a VC-MTJ consists of two ferromagnetic layers (e.g., CoFeB) separated by a tunneling barrier (e.g., MgO) where the magnetization of the pinned layers is fixed, and that of the free layer can be freely switched via electrical or magnetic bias conditions. Two resistance states (R P and R AP ) exist in the device, depending on the orientation of the free layer's magnetic moment with respect to that of the pinned layer.
Switching of the VC-MTJ can be achieved via precessional (also referred to as resonant) or thermally activated switching. Recent research efforts have demonstrated voltage-controlled magnetic anisotropy (VCMA) effect-based switching of perpendicularly magnetized VCMTJs in both precessional and thermally activated regimes, achieving high-speed (<1 ns) and low-energy (<10 fJ) write operations [Alzate 2012 , Kanai 2012 , Shiota 2012 .
The write error rate (WER), defined as WER = 1 − P sw , is an important metric for memory devices, where P sw is the switching probability. Typically, a WER on the order of ∼ 10 −9 is required for working memory, depending on the application and the performance of the error correction code [Nowak 2016 ]. For memory with relatively high WER per write operation, multiple write operations are necessary to meet the desired WER, resulting in increased total write access time.
The WER of STT-MRAM can be exponentially reduced by increasing the write time and/or current density, as shown in Fig. 4(a) . MeRAM, on the other hand, shows oscillatory behavior as a function of the write pulse width due to the precessional motion of magnetization, as shown in Fig. 4(b) . The WERs of both cases also depend on device characteristics such as thermal stability, the damping factor, and the VCMA coefficient. Fig. 5(a) shows the structure of a memory bank, consisting of a crossbar array with columns (BLs) by rows (WLs) of memory cells and peripheral circuits such as drivers, decoder, multiplexer, and sense amplifiers. The storage capacity of a memory chip is often divided into several identical banks to reduce the critical paths [Yamauchi 1997] , at the cost of area efficiency.
B. MeRAM Bank Architecture
It is important to note that there is a tradeoff between the total size of the memory array (capacity) and the performance (e.g., latency, energy). This is because increasing the number of cells in an array raises the capacitive and resistive loading on the shared signal lines, which in turn requires more energy and latency during operation. Therefore, the size of the memory array should be carefully designed based on a targeted application.
Although MeRAM also follows the general memory bank architecture, requirements for the drivers and sensing circuitry are different, as shown in Fig. 5(b) . These requirements include sufficiently strong BL, source line, and word line drivers to maximize the slew rate (> 1 V/100 ps), and the write pulse width should be adjustable with a high resolution (100 ps), as these factors have a large impact on the WER. Also, the sense amplifier should be able to distinguish a small sensing margin (∼ 100 mV) due to the limited tunneling magnetoresistance ratio (TMR, ratio between the two resistance states) of MTJs.
III. PERFORMANCE COMPARISON AND DISCUSSION
In this section, the array-level performances of different memory technologies are compared under two conditions based on 28 nm node CMOS parameters: (i) fixed array capacity (512 × 512 bits), typical bank size of embedded memory, and (ii) fixed array area (200 × 200 μm 2 ), which is equal to the area of a 256 Kbit SRAM array. Table 1 provides the values of the parameters used in this estimation. Table 2 shows the typical cell size for each memory technology in terms of the minimum feature size F. Since an SRAM cell consists of six transistors and requires both NMOS and PMOS, its cell size reaches up to 190 F 2 [Haran 2008 , Natarajan 2008 ]. An STT-MRAM typically has a 50 F 2 cell size due to the large access transistors needed to supply the critical current (> 10 6 A/cm 2 ), re- quired to achieve sub-10 ns switching [DeBrosse 2015 , Lu 2015 . The standalone trench-based DRAM technology (the only nonembedded memory listed in Table 2 ) allows its cell size to be as low as 4-8 F 2 , Cho 2012 . With a logic process, however, the eDRAM cell occupies > 40 F 2 because a larger area is required to maintain sufficient capacitance [Huang 2011 , Pei 2014 . The cell area is approximately 20 F 2 for an MeRAM cell with a standard logic process, since a VC-MTJ does not require a large current for switching. However, in principle, the cell area can be reduced to 8 F 2 if MeRAM adopts a specialized process similar to that of DRAM.
The write access time (t A W ) of the bank is estimated via two methods. In the case of MeRAM and STT-MRAM, the write access time is extracted by combining the peripheral circuit delay (t d ) and the device write time (t w cell ). The device write times of MeRAM and STT-MRAM are chosen to be 1 ns and 5 ns, respectively, in which a WER of 10 −4 is guaranteed, as shown in Fig. 4 , and these numbers are typically observed in many published works ]. For SRAM, DRAM, and eDRAM, the write access time is the sum of the peripheral circuit delay (t d ), the charging time of the BL (t A Drive ) via a driver circuit, and the intrinsic RC delay of a single cell. Note that the delay of the chip I/O interface is excluded in our estimation, as we are assuming embedded applications.
The read access time (t A R ) is obtained by adding the peripheral circuit delay (t d ) and the array delay required for generating a fixed margin (t A RC ). In high-speed read operations, the BL is precharged to a certain potential before a read. The BL then discharges through the selected memory cell, until the voltage difference between the selected cell and the reference is sufficient for the sense amplifier to distinguish. Thus, the read access time is greatly dependent on the total RC load of the bit line and the resistance of the selected cell compared to the write access time.
The write energy is divided into two parts: (i) energy dissipation via capacitive charging (E A C ) and (ii) ohmic loss (E A O ). The former depends on the sum of BL and word line metal capacitances, transistors' junction, and gate capacitance, and is quadratically proportional to the amplitude of the write voltage. The latter is a function of the write Array metal line capacitance
Write energy (Ohmic loss)
N is the number of cells that connect to a single bit line or a word line. V write and t w cell are write voltage and intrinsic device write time, respectively. t A Drive is the charging or discharging time of the bit line via a driver circuit, and FO4 is the delay of an inverter.
voltage, the write time, and the total resistance through the current path. The formulas of these performance parameters are summarized in Table 3 .
A. Condition I: Array Capacity (512 × 512 bits)
An array capacity of 512 × 512 bits is used for performance comparison. The physical length of the array (L A ) is extracted by using the array capacity and the unit cell dimension in each memory technology (A C ). The access time, write energy, and array area of each memory technology are shown in Fig. 6 .
In terms of the write access time, logic-based SRAMs achieve the fastest write operation. While DRAM and eDRAM share similar structures, DRAM adopts a specialized process to allow larger cell capacitance at a relatively small area. This results in DRAM having slightly slower write time than its embedded counterpart but provides wider margins and greater retention times. MeRAM can achieve the same level of performance as volatile working memory (SRAM, DRAM, eDRAM), whereas there is still a gap (> 4 times) for STT-MRAM.
While the read access times of eDRAM/DRAM are subnanosecond, it should be noted that here we are excluding the effect of retention, which greatly decreases the read margin as the charge in a DRAM cell leaks. Magnetic memory suffers from smaller margins, which reflect in the read operation time, but is still in the acceptable regime (∼2 times) compared to that of SRAM. The write energy follows a similar trend as the write time, with MeRAM achieving the same level of energy efficiency as working memory and a 20 times improvement compared to STT-MRAM. However, the nonvolatility of MeRAM can potentially save orders of magnitude in standby energy compared to that of volatile memory.
B. Condition II: Array Area (200 µm × 200 µm)
A 256 Kbit SRAM array occupies the area 40 000 µm 2 in which the other memory technologies can have up to megabits of capacity, as shown in Fig. 7(c) . The performance of these memory devices thus degrades due to the increased RC loading resulting from the increased capacity. While MeRAM suffers from a slightly longer read access time (∼ 4 ns), as shown in Fig. 7(b) , it is possible that the high capacity (∼ 10 times) can compensate for the increased read access time and even achieve higher system throughput by decreasing the cache miss rate. Since the write access time is strongly dependent on the device write time in the cases of MeRAM and STT-MRAM, there is little change in the write time despite a change in the array capacity. However, the size of WL drivers and BL drivers should be adjusted to fulfill the write condition. For the write energy, MeRAM consumes 28 fJ per bit, which is comparable to that of DRAM and higher than that of SRAM and eDRAM by 3.5 times and 2 times, respectively. STT-MRAM requires at least an order of magnetude higher energy compared to other memory techonologies due to its high ohmic dissipation, as shown in Fig. 7(d) .
IV. CONCLUSION
In conclusion, the write access time and dynamic write energy of MeRAM are comparable to those of conventional embedded memories. However, MeRAM provides a large improvement in terms of density over embedded SRAM and STT-MRAM. As the memory array size increases, the read access time may limit the entire system throughput. Here, the higher bit density (smaller area) can favor a faster read, while the relatively high cell resistance of MeRAM can increase read delay. To alleviate this issue, high-speed sensing schemes need to be developed at the circuit-level. At the device-level, TMR can be improved further to enhance the sensing margin, which in turn reduces the read access time.
ACKNOWLEDGMENT
This work was supported in part by the National Science Foundation (NSF) Nanosystems Engineering Research Center for TANMS. The authors would like to acknowledge the collaboration of this research with KACST via the CEGN. The work at Inston was supported in part by an NSF Phase II SBIR award.
