Abstract-Asymmetry in read/write access latencies, which leads to issues of blocking writes and low throughput performance, is a common challenge impeding the integration of emerging nonvolatile memory technologies into high-performance computing systems. Approaches based on multiporting and virtually pipelined architectures, which have been studied in static random access memory (SRAM)/dynamic random access memory (DRAM)/hybrid-SRAM/DRAM memory systems for high throughput performance, provide an attractive means to overcome these challenges through innovations spanning the memory cell to architecture levels. This paper describes a virtually pipelined memory architecture built on a two-port phase change memory (PCM) substrate for high-performance networking applications. The proposed two-port PCM cell significantly reduces the probability of bank conflicts due to blocking writes. We comprehensively evaluate the two-port cell design in terms of programming current, voltage pumping for access transistors, and area overhead. We also propose a two-port PCM bank design to provide the large bandwidth and bank parallelism critical for high-performance networking applications. Furthermore, the virtually pipelined architecture is implemented at the memory controller level, eliminating the extra delay of read requests due to the asymmetrical access latency through separate reservation tables for read and write requests. We analyze the performance of this architecture using queuing theory and statistical information obtained by processing packet traces using PacketBench, and show that the proposed innovations can reduce the expected read (write) delay by 12-40× (up to 14%) over conventional single-port PCM for 110%-170% additional area.
I. INTRODUCTION

M
ODERN network devices, such as Internet routers have become highly dependent on scalable memory architectures. A large amount of data needs to be moved and managed in such devices, requiring significant memory capacity and bandwidth that increases with the line rate [1] . Therefore, memory systems in network devices must be capable of supporting fast read and write accesses at line rates, while also offering the large memory capacity necessary to maintain large data structures. Dynamic random access memory (DRAM) has played a major role in supporting the demands on memory capacity and performance for network processing, largely in the form of hybrid SRAM/DRAM packet buffer [2] , [3] and virtually pipelined memory architectures [1] , [4] . The hybrid SRAM/DRAM packet buffer uses SRAM as the cache and multiple DRAMs in parallel for exclusive storage of packets. In contrast, in virtually pipelined memory, fixed-delay memory accesses are interleaved across the banks in the memory system to provide large access bandwidth, emulating an ideal SRAM with low-cost long latency memory, such as DRAM. However, scaling DRAM below 22 nm is currently unknown [5] , which makes DRAM less suitable for network processing in the big data era.
Phase change memory (PCM), which has shown its scaling advantage in 20-nm device prototypes [6] , [7] and its scaling potential to 5 nm [8] , [9] , offers read latency close to that of DRAM and is a promising candidate to fill this scalability gap. Unfortunately, PCM is an asymmetrical read-write technology with a write latency that is much longer than that of DRAM. The long write latency, usually 5-10× the read latency [7] , [10] , significantly increases bank conflicts over DRAM. In an m-bank DRAM, the probability of a blocking write is 1/m. However, in an m-bank PCM, where the write latency is k× the read latency, the probability of a blocking write is 1 − ((m − 1)/m) k , since it may block the following k accesses. For k ≈ 5-20, the probability of a blocking write in PCM increases linearly in k to the first order. For k = 10 and m = 32, as reported widely in literature, the probability of a blocking write in PCM is 27.2%, while that in DRAM is 3.1%. To make things worse, read requests are latency critical in networking applications and cannot be scheduled with buffers like write requests. Even in general purpose applications, an extensive study has already shown that the asymmetrical latency in PCM may result in as much as 61% performance degradation [11] . When PCM is used to implement virtually pipelined network memory, the long write latency of PCM requires longer fixed pipeline delay for both reads and writes. Simply put, the fixed pipeline delay is a linear function of PCM write latency (at least 10× equivalent DRAM pipelined memory). Thus, the asymmetrical write/read latency inherent to PCM remains the biggest challenge that has to be overcome to realize scalable PCM network memory.
Existing solutions to mitigate the impact of blocking writes in PCM can be classified broadly into write scheduling methods [12] - [16] and architectural write improvement methods [17] , [18] . When write scheduling methods developed for conventional single-port PCM [12] - [16] are applied to network memory, they will primarily attempt to distribute writes among idle port time slots. However, a new packet may arrive every 8 ns on a 40-Gb/s OC-768 line [4] ; as a result, there is little-to-no idle time between memory accesses, making these write scheduling methods less suitable for network memory. On the other hand, architectural write improvement methods, such as [17] and [18] , reduce the write latency at the architecture level by relying on a specific feature of the PCM technology. Specifically, the write truncation technique can only be applied to multilevel cells [17] , and the parallel chip approach works when the PCM supports write division [18] ; as a result, neither approach is general enough to address the limitation of blocking writes in PCM. Although writes also lead to other issues, such as power consumption, cell endurance, and error correction-in PCM as well as other nonvolatile memories-they are not directly considered in this paper. Note, however, that the virtually pipelined memory architecture described in this paper can integrate appropriate solutions from among those described in [19] - [26] . This paper makes the following contributions. We describe a virtually pipelined memory architecture built on a two-port PCM substrate for high-performance networking applications (illustrated from the cell level to the architecture level in Fig. 1 ). We propose a novel two-port 1-read 1-write (1R1W) PCM cell that significantly reduces the probability of write blocking at the bank and architecture levels. Two porting as a solution to reduce blocking in general (and specifically, write blocking in PCM) has a fixed and quantifiable cost at the cell level, and is complementary to both write scheduling and architectural write improvement techniques discussed above. We evaluate pMOS/nMOS-based two-port cells in terms of transistor size, voltage pumping, and area overhead and describe 1R1W PCM cells for both highperformance and low-power designs. The two-port PCM cells are integrated with memory banks that are also two-ported to ensure data coherence between incoming and ongoing read/write requests. The proposed virtually pipelined memory architecture integrates the two-port PCM banks, further minimizes the extra fixed delay of read requests due to the asymmetrical latency through separate read/write reservation tables, and eliminates redundant write requests using lookup tables. Based on queuing theory, we develop an analytical framework to evaluate the performance of memory architectures based on multiport cells with asymmetrical read/write latencies. Analysis of the performance of the proposed architecture shows that for networking applications, a 1R1W PCM substrate can substantially reduce: 1) the expected delay of a read access and 2) the number of waiting read/write requests at the bank level, leading to a smaller required buffer size with low overhead and high performance. In addition, through redundant write elimination, the pipelined memory architecture further reduces the expected delay of a write access. When the theoretical analysis is combined with statistical information obtained by processing network packet traces using PacketBench, results show that the proposed innovations can reduce the expected read delay (expected write delay) by 12-40× (up to 14%) over conventional single-port PCM for 110%-170% additional area.
We briefly present the background of PCM in Section II. Section III describes the two-port PCM cell. Section IV describes the two-port PCM bank and the PCM pipelined memory, followed by the queuing theory analysis in Section V. The simulation setup and results are presented in Section VI. Section VII is a conclusion.
II. BACKGROUND
PCM, an emerging nonvolatile memory, stores data by switching the chalcogenide material between two states: amorphous and polycrystalline. These two states are characterized by remarkably different resistance levels, where the amorphous chalcogenide material has the high resistance, usually in the Mega Ohms range, and the polycrystalline state chalcogenide material has the low resistance, usually in the Kilo Ohms range [27] .
There are three primary operations integral to the use of PCM in a modern memory system: read, SET, and RESET. The read operation loads the data from the memory to the processor or the cache hierarchy. The SET operation writes the bit 1 to the memory cell, i.e., the SET operation changes the state of the chalcogenide material in the cell to amorphous. In contrast, the RESET operation writes the bit 0 to the memory cell by changing the state of the chalcogenide material in the cell to polycrystalline. Each of these operations has its own associated latency and this is discussed in the following paragraphs.
A PCM cell can be read by simply sensing the current flow. Due to the large gap between the two resistance levels of the chalcogenide material, the sensing current flows of these two states differ by three or more orders of magnitude. The latency of the read operation in PCM cells is typically tens of nanoseconds.
In the write operation, the programming circuit of PCM applies different heat-time profiles to switch cells from one state to another. To RESET a PCM cell, a strong programming current pulse of short duration is required. The temperature of the chalcogenide material is raised by this programming pulse. After the chalcogenide material reaches the melting point, typically higher than 600°C, the programming pulse is quickly terminated. Subsequently, the small region of melted material cools quickly, programming the chalcogenide material into the amorphous state. Since the region of the melted chalcogenide material is smaller, the required duration of the RESET programming pulse is short, about tens of nanoseconds. Thus, the RESET latency of PCM is typically similar to its read latency [13] .
In contrast, to SET a PCM device, a long programming current pulse, which is weaker than the RESET programming current, is applied to program the cell from the amorphous state to the polycrystalline state. In the SET operation, the temperature of chalcogenide material should be raised above its crystallization temperature but below the melting point for a sufficient amount of time. As the crystallization rate is a function of temperature, given the variability of PCM cells within an array, reliable crystallization of PCM cells requires a programming pulse of hundreds of nanoseconds in duration [13] . Hence, the SET latency of PCM is longer than both its RESET latency and read latency.
From a system perspective, in a typical PCM write, hundreds of bits in a memory line need to be programmed in PCM cells. It is highly likely that both RESET and SET occur in such a write. Thus, the latency of the whole write is determined by the slower of the two operations, which is the SET operation. Therefore, the latency of the write operation in PCM is higher than the latency of the read operation, leading to the blocking write problem in PCM.
To mitigate the blocking write problem, write scheduling methods [12] - [16] and architectural write improvement methods [17] , [18] , [27] , [28] were studied. A write scheduling method based on write pausing and write cancellation was designed to pause an ongoing write when a read request enters the memory, with thresholding to avoid write starvation [12] . However, when the majority of memory requests is writes and the number of writes in the queue exceeds the threshold, read requests in the queue cannot preempt (i.e., cancel) write requests due to the threshold design and have to incur a huge delay. Write scheduling based on the PreSET method [13] was proposed to issue the time-consuming SET operation before the write operation actually starts. However, a new packet may arrive every 8 ns on a 40-Gb/s OC-768 line [4] , PreSET may not be suitable for network memory, due to very few idle cycles between memory accesses. Similarly, write status holding register scheduling, proposed to hide the resistance drift latency in PCM writes by scheduling memory accesses [14] , write minimization using scheduling and recomputation [15] , and optimizations for video applications running in the system with PCM-based main memory [16] may not be suitable for network applications. On the other hand, architectural write improvement methods, such as [17] and [18] , reduce the write latency at the architecture level by relying on a specific feature of the PCM technology. Thus, neither approach is general enough to address the limitation of blocking writes. Latency-aware coding schemes were proposed to accelerate the PCM write [27] , [28] . However, flipping the value written to memory causes results in overhead since it is necessary to record which bits have been flipped.
Although this paper focuses on solutions to the blocking write problem in PCM, it is essential to note that the long latency write operation also leads to problems in power consumption, cell endurance, and error correction. A number of works have been proposed for addressing these write issues in nonvolatile memories. Dhiman et al. [19] proposed a hybrid PCM and DRAM main memory system, in which a book keeping circuit design as well as a page manager within the operating system, were implemented to address power consumption and endurance problems in PCM writes. Another memory management design was proposed for hybrid PCM and DRAM main memory, with the focus on data migration between DRAM and PCM [20] . Mirhoseini et al. [21] presented an encoding scheme for write in PCM, where only the bits that change in the new word in comparison with the existing word in the memory location would require overwriting, minimizing the power consumption in PCM writes. Joo et al. [22] presented an online flash memory paging scheme for energy and performance optimization. An energy minimization method was proposed for spin-torque-transfer RAM (STT-RAM) in [29] . A wear rate leveling mechanism was designed to protect PCM against endurance variations by targeting the wear rate [23] . Hsiao et al. [24] designed build-in self-repair schemes for flash memory error correction. A fault tolerance scheme for PCM against significant resistance drift was proposed in [25] . Su et al. [26] presented a test method to detect write disturbance fault for magnetic RAM. As mentioned earlier, the virtually pipelined memory architecture described in this paper can integrate appropriate solutions from the literature to address the issues of power consumption, cell endurance, and error correction.
III. TWO-PORT PCM CELL
Multiporting has been studied for on-chip memory technologies including SRAM [30] and embedded DRAM (eDRAM) [31] , [32] : in these scenarios, scalability is not as important a concern as performance. However, unlike on-chip memory, scalability is a major concern when PCM is considered as the candidate for main memory in the network processing domain. Thus, the multiporting techniques in SRAM and eDRAM designs may not be suitable when PCM is considered for network processing, due to the area overhead imposed by extra access transistors.
Multiporting nonvolatile memories is a promising approach to increase the performance of such memories. STT-RAM and PCM are two promising nonvolatile memory technologies. STT-RAM studies focus on on-chip memory applications such as caches, as STT-RAM has low access latency. On the other hand, PCM has been studied as a candidate for next generation main memory technology because of its scalability. A dual-port STT-RAM cell was proposed as an alternative to the multiport SRAM for on-chip memory [33] . In comparison to the conventional single-port STT-RAM cell, the dual-port STT-RAM uses two transistors to establish 2 read/write (2RW) ports. To reduce the cell area, the dual-port STT-RAM design utilizes the shared source-line structure. The area of the 2RW STT-RAM cell is 100F 2 , i.e., the area overhead of dual porting is about 39% in comparison with the single-port STT-RAM cell for similar write performance (72F 2 ). Due to the differences in storage mechanism [magnetic tunnel junction versus Ge 2 Sb 2 Te 5 (GST)] and access speed requirement between STT-RAM and PCM, the 2RW dual-port approach [33] cannot be applied to PCM. Moreover, it is essential to explicitly meet read/write operation requirements during multiport PCM design to offset the performance degradation due to multiporting. A 45-nm 1-Gb single-level cell PCM is designed with the read-while-write feature for embedded and wireless systems [34] . The fundamental difference between the work described in [34] and the work described here is that this paper implements two porting at the cell level, whereas the work in [34] proposed two porting at the region level (16 regions in the bank). Thus, in our proposed architecture, as long as the read and the write access are not issued to the same physical address, they can be served simultaneously. In contrast, in [34] , when the read and the write accesses are issued to the same region, they cannot be served simultaneously.
The remainder of this section presents the proposed two-port PCM design and describes core methods based on voltage pumping and transistor size selection, which are necessary to maintain the programming/sensing current requirements in two porting the conventional PCM cell. We will also estimate the area overhead of the proposed two-port PCM cell, and discuss practical tradeoffs between area overhead and voltage pumping.
A conventional single-port PCM cell uses a GST storage material to store a bit [10] . The GST material can be programmed to amorphous or crystalline phase, each of which has significantly different resistivity, by applying current to heat the GST material. In a conventional PCM, an access device, which can be a transistor (typically nMOS), diode, or bipolar junction transistor (BJT), as shown in Fig. 2 (a) [10] , is connected to the GST. When the access device is selected by the wordline, the bitline directly provides the programming/sensing current to the GST.
Our basic two-port PCM memory cell, which is shown in Fig. 2(b) and (c), consists of two access transistors and a GST storage material. Two bitlines and wordlines are con- nected to two access transistors to compose a two-port (1R1W) design. One of these two ports supports only reads, while the other supports only writes. Transistors are located at the cross points of bitlines and wordlines. Conventionally, nMOS has stronger driving ability than that of pMOS in the single-port PCM cell design, which places the access transistor in the bottom half of the cell. However, as shown in Fig. 2 (b) and (c), we have to place two access transistors in the top half of the cell to separately control the connectivity between the two bitlines [write bitline (WBL) and read bitline (RBL)] and the GST. Since we place the transistors in the top half, nMOS is no longer the obvious choice. Thus, we performed a thorough and fair comparison of pMOS and nMOS access transistors to validate the two-port PCM cell, and the findings are summarized later in this section.
Note that the diode-based PCM cell [35] , as shown in Fig. 3(a) , achieves better scalability than the transistor-based PCM. In the diode-based PCM, when a cell is accessed, the memory controller asserts WL at 0 and BL at 1 to turn ON the diode. However, we believe that the diode cannot be used to construct a two-port PCM cell. We illustrate and discuss this using a nonfunctional circuit schematic for the diodebased two-port PCM cell in Fig. 3(b) , where we place two diodes between two wordlines [word wordline (WWL) and RWL] and the GST. In this design, the lack of the third terminal in diodes makes it impossible to control the read and the write ports independently. When we want to access the write port, we assert WWL at 0 and WBL at 1, RBL is also pulled-up to 1. While the write driver drives the programming current from WBL to GST to WWL, the read sense amplifier also senses the current from RBL to GST to WWL. Thus, read port and write port are both activated in this case. Therefore, we believe that the diode cannot be used to construct a two-port PCM cell. Although a high-level description of a multiport PCM cell is available in [36] , the description does not consider a number of practical issues in multiport design, such as access transistor type selection for the read and write ports, voltage pumping on bitlines and wordlines, and the transistor size selection for ports. To the best of our knowledge, this paper is the first to comprehensively investigate these practical issues and to describe a complete two-port PCM cell.
We have used SPICE simulations to validate the two-port PCM cell with the predictive technology model (PTM) for the access transistors [37] . To evaluate the PCM cell, we model the I -V curve of the GST in Verilog-A with the data from [10] and [38] . The I -V curve is implemented using a lookup table approach. We set the Ovonic threshold switching (OTS) point as I OTS = 10 μA and V OTS = 1.14 V. We use a quadratic function to represent the curve when I < I OTS , and a linear function when I > I OTS , as shown in Fig. 4 .
Since we place access transistors above the GST, we expect that the voltage of bitlines needs to be increased to get the equivalent current when the cell is accessed. As voltage pumping for write access is common in PCM [10] , [39] , increasing the voltage in the write port is a practical approach for the two-port PCM cell. We summarize our results in Table I , which indicates that the two-port PCM cell can achieve equivalent write performance to a conventional single-port PCM cell-700-μA set current and 1000-μA reset current [10] -by [10] . We also set the W/L ratio of both pMOS and nMOS in our two-port cell to four, while the nMOS in the classical PCM has the W/L ratio of five. The advantage of W/L = 4 for nMOS/pMOS W/L = 5 is reduced cell size (60F 2 versus 72F 2 ). Of course, the tradeoff is an increase in V DD (by 3.8%) to provide sufficient programming/sensing current using voltage pumping. We discuss the tradeoffs between the size of access transistors and voltage pumping later in this section.
In the write operation, higher voltage at WBL, V WBL (5.16 V with 32-nm PTM), is needed in pMOS to provide required programming current in comparison with nMOS (4.4 V with 32-nm PTM), which can be achieved by voltage pumping. We also observe that the V WBL necessary for the pMOS write access transistor decreases as the technology scales down, while the V WBL for nMOS write access transistor increases. Thus, we can provision that the required voltage for pMOS may be equal or less than that of nMOS as the technology continues to scale down. Moreover, in the write operation, voltage at both WBL and WWL of the nMOS access transistor need to pump to a certain level (4.4 and 2.9 V), while V WWL in pMOS stays at 1.5 V. Note, however, that after the write operation, the punch-through effect may occur in nMOS if V WWL drops before V WBL . Thus, when using nMOS write access transistors, a voltage pumping control circuit is necessary to avoid the punch-through phenomenon. In summary, even though pMOS as the write access transistor requires 17% higher voltage than nMOS with the 32-nm technology, pMOS may require lower voltage than nMOS as the technology continues to scale down, and the control circuit for the pMOS-based design is simpler than that of the nMOS-based design. Therefore, we believe that using pMOS write access transistors is more practical than using nMOS write access transistors.
Meanwhile, for the PCM read operation, the read current should simultaneously be large enough to enable detection and small enough to avoid disturbance. Thus, in the conventional single-port PCM cell, the V RBL of bitline is set to 0.6 V in the read operation. We investigate the extra voltage needed for the read port of our two-port PCM cell to obtain equivalent per- CONVENTIONAL SINGLE-PORT PCM CELL [10] formance to [10] , which is 5-μA read current in amorphous state. We also summarize the results in Table II , showing that pMOS transistor needs lower V RBL and V RWL (0.97 and 0.3 V) than that of nMOS (1.52 and 1.22 V) to ensure required read current. Since pMOS requires lower V RBL and V RWL (36% and 75%) than nMOS in read port, we select pMOS as the access transistor for the read port. It is worth noting that since the required V WBL of nMOS in the write port is significantly lower than that of pMOS, nMOS can be used to design a low power two-port PCM cell (though with area overhead imposed by the voltage pumping control circuit). Finally, we estimate the cell size of our two-port PCM cell by following the cell area model [40] , [41] . The actual size of the pMOS is 2F × (W/L)F, where the W/L ratio of pMOS in our two-port cell is four, and that of nMOS in the classical cell is five. Including the isolation area, the memory cell size in the two-port PCM cell configuration is 6×2(W/L+1) = 60F 2 (0.486 μm 2 ) in 90-nm technology, shown in Fig. 1(b) . For a fair comparison, we estimate the cell size of the design in [10] with the cell area model in [40] . The estimated cell size is 18F 2 (0.18 μm 2 ) in 100-nm technology. Thus, the area overhead of our proposed two-port PCM cell is 170%, in comparison with the single-port PCM cell in [10] . We also compare the tradeoff between voltage pumping and the size of pMOS access transistors, as shown in Table III . When access transistors have the W/L ratio of three, which means the cell size is 48F 2 (0.38 μm 2 , 110% additional area), the required voltage on the bitline is 5.97/0.93 V for the write/read port, compared with 5.63/0.92 V with access transistor of W/L = 4. Thus, if scalability is more important than power consumption, smaller access transistors should be selected; otherwise, larger access transistors can reduce power consumption.
Advantages of Two Porting: With 170% additional area, every two-port PCM cell takes the area of almost three singleport PCM cells. However, tripling the number of conventional single-port PCM cells in each bank cannot provide improvement in performance, as long as the number of access ports per bank remains unchanged, since the bottleneck is the access port. With only one port at the bank level, only one memory access to the bank can be served at a time. When the single access port of the conventional PCM array is reserved by the access require to a given cell, any other requests to the rest of the array are blocked. In this case, tripling the number of cells per bank cannot address this blocking issue, especially for PCM that has long write latency. However, based on the 1R1W two-port PCM cell design, the 1R1W two-port PCM bank now has two ports, which can serve 1 read and 1 write to the same bank simultaneously, reducing access delay, which doubling the number of single-port cells cannot provide. Note, however, that we can also use the additional cells to increase the number of ports. However, we can only realize a 2Ror1W (or 3Ror1W) multiport PCM cell by duplication (or triplication) with single-port PCM cells. As a result, there is only acceleration of read latency, but the blocking write issue will persist since the three cells will all share the same write port and since the resulting design can support three independent reads or a single write, but not a simultaneous read and write in any form. Thus, we believe that two porting to provide simultaneous read and write access ports will free up the banks to provide the throughput that is essential for PCM main memory applications.
Even though the 5-6F 2 DRAM has been proposed in [35] , we believe our proposed two-port PCM cell design is a more suitable solution to the future high-performance memory system for the following reasons: 1) PCM has better scalability than that of DRAM, as PCM has shown its scaling potential to 5 nm [8] , [9] , while scaling DRAM beyond 22 nm is unknown [5] . In the long term, the overhead of the two-port PCM can be compensated by the advantage in scaling; and 2) PCM also has its other advantages over DRAM, such as low leakage energy, zero refresh energy, and nonvolatility, which make it an attractive candidate for next generation memory technology [11] .
IV. TWO-PORT PCM AND PIPELINED MEMORY
Based on our proposed two-port PCM cell, we further two port the PCM bank, as shown in Fig. 1(d) , wherein twoport PCM cell arrays are organized in blocks. Provisioning for a separate read/write port for each bank can significantly reduce the delay of a read/write request, due to reduction in the number of bank conflicts. The read/write buffers in the PCM banks ensure data coherence between incoming and ongoing read/write requests. Furthermore, we redesign the virtually pipelined memory to fully utilize the benefit of the two-port PCM, as shown in Fig. 1(e) . With our dualreservation-table design, the fixed delay of read requests can be significantly reduced. Meanwhile, we eliminate redundant write requests by tracking requests using a lookup table, and only serving write requests with the latest updated written data to the same address. The rest of this section describes the rearchitected design from the cell array level to the pipelined memory level.
A. Two-Port PCM Cell Array and Bank
The two-port PCM cell arrays, as shown in Fig. 1(c) , are organized in banks [10] such that a block consists of four cell arrays. The write current driver circuits with charge pumps are connected only to the write port bitlines, while the read sense amplifier circuits only serve the read ports [ Fig. 1(d) ]. Each cell array of 1 Mb is built with 2048 local bitlines and 2048 local wordlines. In a read access, the 4-kb cells connected to the selected local wordline are activated by the local wordline decoder and the global wordline decoder of Random address remapping is realized at the virtually pipelined memory level [ Fig. 1(e) ], while pipelined memory accesses are served at the two-port PCM bank level [ Fig. 1(d) ]. the selected block. Thus, 8 bytes of data are latched at the sense amplifier, which are connected to the local RBLs of the two-port PCM array. Current sense amplifiers are implemented for faster sensing, relying on current differences to create a differential voltage at the amplifier [10] . Although we adopt a similar wordline selection mechanism for write access, the major difference is that the local WBLs of the two-port PCM array are driven by the write driver and the charge pump.
At the bank level, we use a write buffer to queue write accesses. When a read access is issued to a PCM bank, the memory controller of this bank first checks the write buffer to see if there is a pending write to the same row in the write buffer, or if a write has been issued to the write port. In either case, data forwarding is implemented and the read access is serviced without accessing the cell array. Thus, as long as the write buffer does not overflow, write requests can be buffered and retired without blocking any read request. Meanwhile, when a write request is blocked by an ongoing read access to the same page, the write request remains in the write buffer until the read is completed. Since the read latency of PCM is 5-10× lower than the write latency of PCM, this blocking is insignificant to the performance of network memory.
Each bank has 32 read and write buffer entries, similar to [42] . Since our primary focus was realizing high-throughput PCM-based memory for network processing, we defer to [42] and [43] for detailed studies of read/write buffer sizing for performance optimization.
B. Pipelined Memory Architecture
At the highest level, the virtually pipelined PCM memory is implemented, as shown in Fig. 1(e) . It consists of a random address remapping function, a pair of reservation tables, the read/write request buffers, and the read/write tracking lookup table.
1) Random Address Remapping: In the virtually pipelined PCM memory, a random memory address remapping function is realized to increase bank parallelism, as shown in Fig. 5 . Due to the spatial locality exhibited by the memory access patterns of network applications, a relatively small region of the memory may be accessed intensively in a short period of time, resulting in severe bank conflicts. By randomly remapping the memory address, the spatial locality in memory accesses can be reduced, improving bank parallelism. In our architecture, we use a random invertible binary matrix for remapping, which incurs lower hardware overhead than conventional universal hashing [1] . A random invertible binary matrix can perform a one-to-one linear mapping from the original address space O to the remapped address space R [44] . Given an n-bit memory address, we can form an n ×n matrix, M, by randomly selecting elements from {0, 1}, such that M is invertible. The remapped address R is generated by R = O × R. The matrix multiplication can be realized using AND and XOR gates.
Note that random address remapping that considers the row buffer hit rate [45] and the translation between physical address and rank/channel address [46] can improve randomness, thereby enhancing bank level parallelism in memory accesses. However, the main focus of this paper is to construct the pipelined memory architecture that fully utilizes the advantage of the 1R1W two-port PCM cells and provides low-delay read and write access for next generation PCM-based network memory. Certainly, our proposed architecture is complementary to the existing random address remapping methods that consider the row buffer hit rate and the translation between physical address and rank/channel address. Such a comprehensive evaluation, especially using techniques, such as random address remapping that are themselves active areas of research, is out of the scope of this paper.
2) Dual Reservation Tables and Read/Write Request Buffers:
In the conventional pipelined memory architecture, remapped memory accesses are stored in the reservation table with δ entries, where δ is a function of DRAM access latency [1] , [4] . Each entry includes a data field and the address field. For each access arriving at time t, an entry is created in the reservation table at location (t + δ) mod δ, so that it can be served with a fixed delay δ. However, since both read and write requests are subject to identical fixed delay, this is unsuitable for PCM.
Our architecture uses a dual reservation-table design to ensure that read requests are not penalized. One of the reservation tables is for read requests, while the other is for write requests. The write reservation table has δ w entries, which is 5-10× the size of the read reservation table, δ r . To reduce redundant memory accesses, we include two values in each entry of the reservation table: a status value and a pointer, which can point to any other entry in the dual reservation table storing requests to the same address. Details of the operation of pipelined memory will be discussed in the following section.
Even though requests are serviced with fixed delay in the dual reservation tables, the actual memory access may happen anytime before the fixed delay. We use buffers to queue the requests to each PCM bank. A write request buffer and a read request buffer, which are realized by SRAM, are associated to the corresponding ports of each PCM bank to provide the throughput for network processing. Each entry of these buffers only includes a pointer to the corresponding entry in the reservation table. We also implement the write tracking lookup table and the read tracking lookup table using content addressable memory for tracking the latest data update of a given memory address. These lookup tables include two values: the memory address and the pointer to the corresponding entry in the reservation tables.
3) Pipelined Memory Operations: When a write request enters the virtually pipelined memory at time t 1 , it is stored in the write reservation table at the location (t 1 + δ w ) mod δ w . We first set the status value of this entry to 0, i.e., not ready, and fill the data field with the written data. At the same time, a new entry in the write tracking lookup table is created by updating the pointer to the location (t 1 + δ w ) mod δ w in the write reservation table, if there is no existing corresponding entry in the lookup table. Otherwise, the corresponding entry in the write tracking lookup table updates its pointer to the location of (t 1 + δ w ) mod δ w . Furthermore, this request also enters the write request buffer of the destination bank. Whenever the write port of a bank is ready, it pops the top request of the write request buffer and checks if the entry of the request buffer and the corresponding entry in the write lookup table store identical pointers. If there is a match, the write request is up-to-date, and it enters the write buffer of the bank; this entry in the request buffer is simultaneously dropped. If there is no match, there is another write request to the same address with updated data, and this old request is dropped. After the request is either pushed into the bank or dropped, we update the status value of the entry in the write reservation table to 1, flagging that it is completed at the level of pipelined memory.
In addition, when a read request enters the virtually pipelined memory at time t 2 , an entry will be created at the location of (t 2 + δ r ) mod δ r . The status value of this entry is set to 0, and the data field is left empty. A check to both the write and the read tracking lookup table is performed. If the tables have the corresponding entry, the pointer of this entry in the reservation table points to the latest corresponding location indicated in the lookup tables. If the pointer of this new request points to an existing entry in the write reservation table, this read request in the read reservation table fills the data field by copying the corresponding write entry in the write reservation table, and sets the status value to 1. If the pointer points to an existing read request in the read reservation table, this new read entry copies the data and the status value from the existing request, regardless of whether it is empty or not. In the case where the new read request cannot find any previous request to the same address in tracking lookup tables, the read request is pushed in the read request buffer of the destination bank and a corresponding entry in the read tracking lookup table is created. Once the read port of a bank is ready, it serves the top entry of the read request buffer. Unlike the write operation, the read port does not check the read tracking table; it instead drops requests with the status value of 1, and serves requests with the status value of 0. After the data are ready at the read port: 1) the data are updated in the data field of the corresponding entry in the read reservation table; 2) the status value is set to 1, meaning the data are ready; and 3) the data field and the status value of all read requests pointing to the current served read request are updated accordingly.
At time t, the virtually pipelined memory checks the status value of the entry in the write reservation table at the location of t mod δ w and the entry in the read reservation table at the location of t mod δ r . If any of them has the status value of 0, the pipelined memory issues a throttling signal to the CPU to suspend memory access requests until the stall is resolved.
V. QUEUING THEORETICAL ANALYSIS
Although the performance of conventional single-port DRAM architectures has been studied within the framework of queuing theory [47] , this approach cannot be directly applied to: 1) single-port PCM, due to its asymmetrical read/write latency or 2) the proposed two-port PCM, due to the differences in how a two port is accessed. To the best of our knowledge, this paper provides the first framework for the modeling and analysis of the performance of PCM-based memory architectures.
A. Single-Port DRAM Analysis
The performance of conventional single-port DRAM architectures has been studied within the framework of queuing theory [47] . In the conventional case, an m-bank DRAM can be modeled with an open queuing model. We assume that the minimum cycle time of bank reservation time is t M , which means that two consecutive memory accesses to the same bank cannot be served within t M cycles. In addition, we also assume the average interval cycle of two consecutive memory accesses issued by the CPU is t A . Since memory addresses are randomly remapped in pipelined memory, we assume memory address is uniformly distributed overall the m banks. Moreover, even though there are m banks serving accesses simultaneously, the bank address of each memory access is determined before it enters the memory system, which means it can only be served by the specific bank. Therefore, we model the system as m independent M/D/1 queues, with arrivals determined by a Poisson process, deterministic service time, and only one server per queue. For each independent queue, the access rate α and the memory utilization μ are given by
The average number of waiting requests in the M/D/1 queue, W, is given by the following expression:
Furthermore, from Little's result, we obtain that the average delay of the queue D
B. Single-Port PCM Analysis
As in the case of conventional DRAM, we assume that the service time of a memory access is a random variable. as an M/G/1 queue, where G denotes that the service times have a general distribution (Bernoulli distribution in this case). We assume the minimum cycle time of bank reservation time of a write access is t MW , while the minimum cycle time of bank reservation time of a read access is t MR , and the fraction of write accesses overall memory accesses is p, while the fraction of read accesses is q = 1 − p. Therefore, the average service time of a port, t SP , and the second moment of the service time, t 2 SP are
As in the case of the conventional DRAM, two performance metrics that we consider are W and D, but they are evaluated using the Pollaczek-Khinchin formula [48] 
where α = 1/(m × t A ). Different from that in DRAM, the fraction of write requests, p, has a dominant impact on both the number of waiting requests and the average delay. Both of them grow superlinearly as the fraction of write requests increases.
C. Two-Port PCM Analysis
For our two-port (1R1W) PCM, each port in the two-port PCM can be modeled as an independent M/D/1 queue. In this model, the write port queue has the deterministic service time of t MW , while that of the read port queue is t MR . The access rate for a write port is α w = p/(m × t A ), and α R = q/(m × t A ) for a read port. Thus, the memory utilization is μ w = t MW × α w for the write port and μ w = t MW × α w for the read port. Applying Little's result, we obtain the number of requests waiting in queue, W w for a write port and W R for 
From the expressions above, we observe that the number of waiting requests in the write/read port is superlinear in the fraction of the corresponding memory access. In addition, the average delay grows superlinearly to the fraction of the corresponding memory access. For convenience, we summarize the results of the analysis of the proposed architecture using queuing theory in Table IV . With this analysis, we have been able to compare the different memory architectures as follows.
When accesses are only reads, i.e., p = 0, the average delay D and the number of waiting requests W of both singleport PCM and the read port of two-port PCM are identical to that of DRAM, assuming that the read latencies of PCM and DRAM are the same. However, when the fraction of write accesses, p, increases, the delay of single-port PCM increases quadratically, as shown in Fig. 6(a) , while the delay of the read port in the two-port PCM slowly decreases, as shown in Fig. 6(b) . Thus, the two-port PCM design can serve reads much faster than traditional PCM when a significant portion of accesses is write accesses. The delay of the write port in the two-port PCM also grows quadratically, which is attributed to the inherent long write latency of PCM. We also show the number of waiting requests as the fraction of write accesses grows in Fig. 7(a) and (b) . The single-port PCM and the write port of two-port design again have quadratic increase in the number of waiting requests, while that of the read port of two-port PCM remains lower than that of single-port DRAM.
VI. SIMULATION SETUP AND RESULTS
Our simulation framework considers three different applications: IP Security (IPSec) protocol, Flow Classification (Flow-Class), and an IPv4 packet forwarding applications (IPv4-radix) from PacketBench [49] . These three applications represent various network processing applications: IPSec reads and modifies the packet payload, Flow-Class is a classic network monitoring application, and IPv4-radix represents the most common applications in network processing: packet forwarding. We simulate the network processor on the SimpleScalar simulator configured for an ARM core at 667 MHz [50] and a 256-bank memory. The write/read latency of DRAM and the read latency of PCM are set to 40 ns, and the write latency of PCM is set to 200 ns. Note that we do not implement write scheduling methods, such as write pausing and read-priority-over-write in our evaluation, although write pausing and read-priority-over-write can improve the performance of PCM in general-purpose computing applications. The reason is that our proposed twoport PCM is designed/advocated as a candidate solution for networking applications, where memory accesses arrive at a high frequency. For example, a new packet may arrive every 8 ns on a 40-Gb/s OC-768 line; as a result, there is little-tono idle time between memory accesses, making these write scheduling methods less suitable for network memory. Certainly, write pausing and read-priority-over-write are complementary to our proposed two-port PCM architecture and can further improve performance in the general-purpose application domain. We use the traces collected from Center for Applied Internet Data Analysis Equinix-Chicago monitor in 2003, 2008, and 2011 . In our simulation, we compare the performance of three pipelined memory architectures: single-port DRAM, single-port PCM, and our proposed two-port PCM, by evaluating four different metrics: memory access rate, port utilization, average number of waiting requests, and average delay. Fig. 8 . Memory access rate of combinations of packet traces and network applications. This metric is determined by the memory access pattern of the application and the packet arrival rate of the trace. Thus, the access rates in both single-port DRAM and PCM are identical. The difference between the write port and the read port in the two-port PCM shows the fraction of write (read) accesses in the application. 
A. Memory Access Rate
It is the mean number of memory accesses to a port per unit time, which is 1 ns in this case. It is equal to the reciprocal of the mean interaccess time. Memory access rate is important in practice since it is a memory-architectureindependent metric, reflecting the memory access pattern of the given combination of a network application and a packet trace.
We observe various memory access rates in different combinations of packet traces and applications, as shown in Fig. 8 . As network applications, IPSec reads and modifies not only the header but also the packet payload, resulting in the highest memory access rate, while Flow-Class only monitors the header of the packet, leading to the lowest memory access rate, up to only 9.1% of that of IPSec. The memory access rate of IPv4-radix is between that of IPSec and Flow-Class, about 79%-47.1% of that of IPSec. By comparing the memory access rate of the write port and the read port of the PCM, we observe that the ratio of write memory accesses overall memory accesses, p is smaller in both IPv4-radix and Flow-Class, since IPv4-radix, as a routing algorithm, does not write to the packet buffer, and Flow-Class has few packet buffer writes. In addition, faster packet traces, such as the 2011 trace, exhibit higher access rates. In the 2008 trace, the memory access rate is about 50% less than the 2011 trace, while the 2003 traces is about 90% slower than the 2011 trace. 
B. Port Utilization
It is the fraction of time when the port is busy, and is generally determined by the memory access rate and memory latency. It not only shows the occupation of the memory port but also illustrates how the memory port is able to tolerate faster memory accesses.
We show the port utilization in different architectures in Fig. 9 . Different from the memory access rate, which is only determined by the timing of incoming packets and the network application processing packets, the port utilization is also related to the speed of processing memory accesses. Even though single-port PCM and single-port DRAM have identical memory access rate, single-port PCM has the highest port utilization in all applications, since both write accesses and read accesses compete for the single port. The read port in our twoport PCM exhibits the lowest utilization, due to the fact that it only processes read accesses to PCM. For example, consider the IPSec with the 2011 trace, where the memory access rate of the write port and the read port in PCM is almost the same, the write port utilization is 5× the read port utilization, which is consistent with the PCM parameters, i.e., PCM write latency is 5× PCM read latency. We also show the average port utilization of the two ports of the two-port PCM, which is significantly lower than that of single-port PCM.
C. Average Delay
This delay, also known as the queuing delay, is between the point of entry of a request in the virtually pipelined memory architecture to the point when the request leaves the request buffer and enters the memory bank. The average delay is the mean queuing delay among all memory accesses. Average delay is an important metric related to memory performance. When the queuing delay is reduced for memory accesses, the network processor can achieve high performance without waiting for load/store operations.
We show the average delay in Fig. 10 . The average delay of single-port PCM is up to 16.8× the average delay of singleport DRAM, when the memory access rate of both is identical. This large extra delay is mostly caused by the long write latency, which is illustrated by the comparison between the delays of single-port PCM and that of the write/read Fig. 11 . Average number of waiting requests. This metric is the average number of requests waiting in request buffers over the total processing time.
port in two-port PCM. For example, when we consider the 2011 IPSec, single-port PCM has the longest expected read delay of 56.9 ns. In contrast, the expected read delay within the proposed two-port architecture is only 1.39 ns, which is 40× or lower than that of single-port PCM. As the packet arrival rate increases, the delay of single-port PCM increases, from the 2003 trace to the 2011 trace, which shows that conventional PCM may not meet the requirements for network memory. Furthermore, the sum of the expected write and read delays in the two-port PCM is still less than the delay of single-port PCM, indicating that the overall memory performance of the proposed two-port PCM is better than conventional PCM.
D. Average Number of Waiting Requests
Waiting requests are those requests that have entered the virtually pipelined memory, but that have not been served at the bank level. They are held in request buffers in the virtually pipelined memory architecture. The average number of waiting requests is the mean number of waiting requests per unit time. The average number of waiting requests is critical in network memory, since it determines the size of the request buffers in the network device design.
For the average number of waiting requests, the proposed two-port PCM architecture provides significant improvements, as shown in Fig. 11 . The average number of waiting requests in the write port of the two-port PCM is only 27%-48% the average number of waiting requests in the write port of singleport PCM, while the number of waiting requests is less than 6% in comparison with single-port PCM. Furthermore, the sum of the number of waiting requests in the write/read ports is only 33.1%-49.8% in comparison with single-port PCM, i.e., the proposed two-port PCM architecture can reduce the size of packet buffer by 50% in comparison with conventional PCM.
In the discussion above, we set the ratio of write latency to read latency of PCM to five. Since the long PCM latency is the major cause of the performance degradation in using PCM as network memory, we also conduct a sensitivity analysis to PCM write latency.
We first evaluate the average delay and the average number of waiting requests of our proposed two-port PCM and the conventional PCM with different packet traces. We use the IPSec application, and change the ratio of write latency to read latency from 1 (similar to DRAM) to 10 (double the previous write latency setting). As shown in Figs. 12 and 13, singleport PCM with the fastest packet trace (the 2011 trace) is the most sensitive to the write latency. That is because single-port PCM is more likely to experience access congestion when the increasing write latency results in more blocking writes in a high-speed network device. Based on this observation, it is likely that any increase in the write latency due to process variation may cause severe problems in using conventional PCM in future network memory, where packets arrive at higher rates.
We also conduct the write latency sensitivity analysis with different network applications, using the 2011 trace, as shown in Figs. 14 and 15. Single-port PCM processing IPSec traces is the most sensitive to increasing write latency. This is to be expected since IPSec modifies the packet payload, resulting in the highest fraction of write accesses overall memory accesses. As the fraction of writes increases, single-port PCM in increasingly unable to serve these writes. As network security issues draw more attention, the increasing number of writes in network applications makes conventional PCM less suitable for future network memory, while our proposed two-port PCM is obviously a better option for network memory.
VII. CONCLUSION
In this paper, we described a novel design of two-port PCM virtually pipelined memory for high-performance network processing. We proposed a novel two-port PCM cell to reduce write blocking at the bank and architecture levels. We comprehensively evaluated the two-port cell design in terms of programming current, necessary voltage pumping for access transistors, and area overhead. The architecture also two ports the memory banks to ensure the data coherence between incoming and ongoing read/write requests. Finally, we proposed the virtually pipelined memory architecture to further eliminate the extra delay of read request due to the asymmetrical access latency through separate reservation tables for read and write requests. Analysis of the performance of the architecture using queuing theory shows that a 1R1W PCM substrate can significantly reduce the expected delay of a read access for networking applications. Further, it also reduces the number of waiting requests at the bank level, leading to a smaller buffer size. When the theoretical analysis is combined with statistical information obtained by processing network packet traces using PacketBench, results show that the proposed innovations can reduce the expected read (write) delay by 12-40× (up to 14%) over conventional single-port PCM for 110%-170% additional area.
