Although flash memory solid state drives (FSSDs) outperform traditional hard disk drives (HDDs), their performance still fails to cope up with the perennial doubling speeds of microprocessors, regardless of the available high bandwidth. To alleviate this bottleneck, many semiconductor companies, such as Intel, Micron, Samsung, and Hynix have already recently manufactured faster and more scalable non-volatile memory (NVM) technology as main memory but none so far have publicly announced their implementation or production of a full NVM Phase Change Memory SSD (PCM-SSD). Considering implementing NVM-PCM as secondary memory, we can build a future PCM-SSD (PSSD) to replace the slow traditional FSSD. However, a careful design, especially for the controller is essential to hide and manage PCM endurance constraints, in-place-updates ability, bit-addressability and enabling it to appear as a block device to the host as their predecessors (HDD and FSSD) do. In this paper, we propose implementing ExTENDS, a hardware assumption of NVM-PCM instead of the NVM-flash memory as our future secondary/persistent memory in storage systems. We further present a PCM file translation layer (PhaseFTL) that can efficiently manage address translations from a host file system to PCM while hiding PCM constrains and allowing the PCM blocks to wear down evenly. Moreover, PhaseFTL can efficiently manipulate the bit-addressability and in-place-update feature of PCM. Our experimental results shows that our proposed PSSD can improve overall SSD performance throughput by an average of 69% compared to traditional FSSDs.
I. INTRODUCTION
In computer systems, microprocessor (i.e CPU) technology continues to advance, thus enabling their performance to consistently improve year after year. However, the performance speed of persistent storage (e.g. NAND solid state drives (SSDs)) lags, even though flash memory SSD (FSSD) was an improvement over hard disk drives (HDDs) with a 10X performance improvement [1] , [2] . Efforts to eliminate this performance discrepancy is evidenced by efforts from many semiconductor companies, such as Intel, Hynix, Micron, Samsung, and Toshiba, that have already The associate editor coordinating the review of this manuscript and approving it for publication was Hao Luo. manufactured faster and more scalable non-volatile memory (NVM) technologies. These technologies include phase change memory (PCM) [3] , magnetoresistive randomaccess memory (MRAM) [4] , spin-torque transfer memory (STTRAM) [5] , Micron's 3D XPoint Technology [6] , resistive random-access memory (ReRAM) [7] and so on.
Given that the above-mentioned developments can support the non-volatile characteristics of conventional FSSDs and provide speeds close to those of DRAM [8] , storage media performance can fully utilize this available high bandwidth nearing to that of CPUs. Thus, storage media performance can then fully leverage the high speeds and bandwidth of the faster microprocessors [6] , [9] . Non-volatile PCM has been considered the closest candidate with prototype chips and devices already having been developed [10] . Moreover, some studies in developing PCM technology into real non-volatile DRAM cache storage replacement in FSSDs have already been proposed [3] , [11] , [12] even though currently they have not yet been fully developed into real-world products.
Furthermore, Intel and Micron have already started producing Optane SSDs that are already on the market from the past two years and their design is backed by Intel's 3D XPoint technology [6] . They have argued that their SSD employing this technology is 1000X faster than current FSSDs although the current Optane SSD's performance is four to ten times faster than the high-end NVMe NAND SSDs. Another limiting factor is the cost price/gigabyte of storage where the Optane SSD cost around five times more than the flash NVMe SSDs. Furthermore, because of the undisclosed information regarding the internal structure of the Optane SSD or the memory cell types composition used in developing the 3D Xpoint technology. We certainly cannot claim whether it is composed of any particular NVM cells that we know like PCM other than basing on assumptions. [12] .
Compared with FSSD, a PCM-based SSD (PSSD), if implemented into a product, would come with several advantages in terms of performance, durability, and energy savings [13] . That is, in addition to bit addressability [6] , PCM offers around a response time that is 300-times faster (0.5µs from Table 1 ) and possesses page overwrite capabilities, also known as in-place updates that lacks in FSSDs, enabling it to process a single write operation in under 1µs [14] . Table 1 shows such a comparison between DRAM, PCM and flash memory cells. In addition, PCM uses approximately 6 Joule/Gigabyte (GB) of energy for a single program operation which is one-third that used by FSSD [15] . Furthermore, PCM has a high-density scalability advantage.
To serve all incoming I/O requests, SSDs have a controller that manages all the activities between host and user data in storage. It is comprises of a system software module called a Flash Translation Layer (FTL) [16] , [17] . FTL's major roles includes (i) logical-to-physical address translation, (ii) wear-leveling techniques and (iii) garbage collection policy for maintaining and sustaining free SSD space, which only applies to FSSDs and not to PSSDs. Mimicking traditional HDD block device characteristics and exposing an array of logical addresses to the upper-level file would require an FTL. Furthermore, a PCM FTL should be able to hide the underlying PCM constraints and operations, just as it does to FSSD. Unfortunately, FSSD FTLs cannot be adopted in PSSDs because these FTL algorithms were designed to hide the out-of-place updates feature of flash memory; thus, if they were applied directly to PCM-based systems, they can cause unnecessary frequent write operations on PCM mapping tables, [18] , [19] . Therefore, directly adopting FSSD FTLs into PSSDs would not be feasible, which is of great concern.
Apart from the above mentioned non-volatile PCM advantages over flash memory, more work is required for PCM to be adopted as a storage alternative to flash. Most importantly, wear-leveling techniques [10] , [20] are essential for PSSDs to address the issue of extensive writes on the mapping tables. Simultaneously, balancing-off hot cell regions with cold regions to sustain a uniform cell lifespan is essential. Moreover, the bit-alterability characteristic of PCM [6] , [9] , [10] , [21] which is absent in FSSD, should be considered. Furthermore, the absence of erase operations due to the in-place-update capabilities of PCM [22] allows for the exclusion of merging and/or garbage-collection algorithms within a PCM-based SSD controller unlike in FSSD.
Considering all these favorable characteristics of PCM, we propose implementing a hardware assumption of nonvolatile PCM instead of the legacy flash memory as our future secondary/persistent memory in storage systems. This study includes PCM-based storage architectures and PCM-aware lifespan and data management schemes that can be adopted as a replacement of the traditional flash memory-based SSDs. Our goal is to alleviate the decadelong speed disparity between the storage media and microprocessor in computer systems. This paper has the following contributions:
• We make a fundamental change from the current flash memory to a future PCM-based memory cell as an alternative non-volatile secondary memory proposal to achieve high bandwidth speeds in the storage arrays of computer systems.
• We employ an On-Demand-Based selective-page-level mapping scheme on our proposed future PSSD, wherein PCM besides storing all data, is used for cold mappings and DRAM is used for hot mapping entries.
• We explore in detail the possible ways/architectures in which PCM can be implemented efficiently in SSDs. We propose an efficient PCM-File Translation Layer (PhaseFTL) to conceal PCM constraints and simultaneously protecting its cells from rapid wear with improved parallelism of I/O operations far beyond the capabilities of flash memory. PhaseFTL can also be applied to any NVM technology storage array.
• We used our own hardware assumptions and simulated non-volatile PCM-based SSD platform that mimics 3D Xpoint Technology architecture. We implemented this proposed future approach (PSSD employing PhaseFTL) onto EagleTree simulator while allowing for bit/byte addressability. The remainder of this paper is organized as follows: Section II outlines a brief discussion on the previous) works while Section III discusses the background and motivation. Section IV shows the architectural designs and implementation of our proposed PCM to SSDs. Section V presents our experimental evaluation and results. We then conclude in Section VI and provide a brief insight of our future studies.
II. RELATED WORK
Most recent FTL schemes [18] , [18] , [19] , [23] - [25] were designed for hybrid NAND+PCM SSDs and cannot effectively support our proposed PSSD. PCM lifespan (10 8 writes) needs careful consideration as its endurance is measured by number of programs per each cell before its first bit error appears without the use of ECC [22] .
Trying to address PCM's wear out, a previous study [20] proposed a technique called Start-Gap that uses algebraic mapping of logic and physical addresses to avoid write counts per-page tracking. Such is evidenced in table-based wear leveling algorithms like [26] and has the disadvantage of storage and latency overheads [12] since larger tables are required for tracking and in-memory relocation of the pages. Using two registers (start & gap), the authors argue that PCM lifetime can be achieved by randomly changing each line's location to its neighboring line irrespective of it being a cold or hot region. Even though this technique endures fewer hardware overheads, it does not consider the access patterns of device and most importantly, the PCM cell's update frequencies.
To mitigate this, an application aware strategy that considers the update frequencies to chip called PTL [12] was proposed. PTL uses an application aware-strategy, AWL to identifying hot or cold data and tries to separately allocate them onto PCM chips. However, even though PTL was designed for PCM cell management with flash, it is an improvement of PCM-FTL [23] and cannot fully service SSDs that use PCM (PSSDs) as a user-data-store as it is based on the underlying flash pages.
Even though there is not much research work on PCM implementation as a secondary memory storage unit in SSDs, a few [1] , [3] , [18] have proposed such models. Their contributions have been a big step to promoting new ideas to this emerging technology (PCM). A prototype emulating PCM storage array called Moneta [1] , was proposed, though their basic design did not rely on the particular characteristics of PCM. On the contrary, they proposed a DDR2 memory array designed to emulate PCM as they assumed that PCM's latency and bandwidth is close to that of DRAM. Even though Moneta's design did not confide with PCM characteristics. The authors' analysis suggested increasing parallelism (doubling the number of controllers) to hide the cell access latency for their prototype thereby improving memory latency.
Chang et al. [15] proposed a hybrid PCM that employs a pattern-aware write-back software solution strategy on the host application level to mitigate PCM energy consumption. By selecting data pages according to their bit pattern, viz. data access locality and PCM character, authors proposed writing data from main memory to PCM device with minimized energy consumption.
However, the approach differs from ours because it mainly focuses on saving energy and does not consider the performance degradation effects of implementing FTL on the application-level operating system other than on PCM. The application level cannot fully utilize the high bandwidth provided by PCM, thereby creating a performance bottleneck. On the contrary, we apply our FTL and such PCM management schemes inside our PCM device to exploit this high bandwidth of PCM for better performance and efficient energy consumption.
Furthermore, A write-activity-aware FTL scheme called PCM-FTL [18] was proposed and replaced DRAM with PCM. PCM-FTL tries to manage NAND flash memory's intensive write activity by employing a two-level mapping scheme. That is, it uses block-level mapping stored in a buffer to store hot sequential requests and a page-level mapping table to handle cold random requests. PCM-FTL's buffer however, becomes very hot due to frequently updates and thus cannot be efficiently implemented onto a full PCM SSD. Moreover, it relies on flash memory pages to update its mapping entries and therefore cannot adopt a PCM secondary memory environment compared to our PhaseFTL.
III. BACKGROUND AND MOTIVATION A. BACKGROUND
Improving the slow performance of storage systems in computer systems is important and has led to a variety of changes in persistent storage media, particularly in SSDs memory cells. The most common design that has been widely adopted is DRAM and flash memory as a cache and persistent storage (DRAM + Flash), respectively. Due to the scalability and volatile limitations of DRAM, the entire page-level mapping table image cannot be loaded into the cache, resulting in a small portion being cached. This causes performance degradation resulting from cache miss penalty when such a system is exposed to small random write/read dominant workload environments [27] , [28] . Consequently, replacement of DRAM with a more scalable non-volatile memory that has read/write speeds close to DRAM has been proposed. PCM technology is a promising future large-scale memory candidate because of its non-volatility, faster read/write operations speed, which is close to that of DRAM, in-place update ability, and better endurance compared to flash memory [1] , [6] , [11] , [3] , [10] , [15] . The storage element of PCM is composed of chalcogenide glass [6] (i.e. a resistor and phase change material) that separates two electrodes as shown in Figure 1 (a). When exploited by applying heat electrical pulses, the glass property can switch between the two phase of material states, high resistance amorphous (RESET) and low resistance crystalline (SET) [3] , [29] . Figure 1 (b) illustrates how the phase change material state can be switched reliably by switching temperature levels (t x . . . t m ). Thus, aiding crystalline (logical 1) and amorphous (logical 0) states to provide the ability to store data bits in PCM devices. Consequently, as shown by Figure 1(b) , data can then be made persistent [29] .
PCM Operations: When a small current is applied to the phase and the resistance of material measured, a read operation will committed [11] . Alternating the phase represents a write operation and requires the controller to instruct the target chip to commit data to the non-volatile array and then poll the status register to detect the successful completion of a write operation [10] , [25] , [30] . Wear-out is still of concern for PCM, because its lifetime estimate (10 8 ) only shows the number of programs per each cell before its first bit error appears free of Error Correction Code (ECC) [22] use. On the contrary, the endurance ratings of a flash memory are measured by the number of program/erase cycles till the ECC scheme can no longer correct the errors. More work focusing on PCM lifespan is of importance and better wear-leveling and cell organizations are crucial [31] , [32] .
We propose replacing the widely adopted DRAM+Flash (FSSD) with a DRAM+PCM (PSSD) architecture to alleviate unnecessary write amplifications. Moreover, the traditional DRAM+Flash approach suffers from long garbage collection/merging processes which induces further request delays and conflicts apart from the above-mentioned weakness.
B. MOTIVATION
As previously mentioned, FSSDs suffer from performance drawbacks mainly because of their inability to: (i) update requests in-place, (ii) bit-address requests and (iii) slow read & write speeds. Because empty flash pages should be constantly availed to accommodate writes, frequent long erase and migration operations called Garbage Collection (GC) for instance, are a flash custom [33] to maintain the constant availability of fresh blocks. When fresh pages run-out, read/write operations are delayed while the system claims for the victim blocks with invalidated pages, allowing the for long GC processes which also poses more threat to the limited lifespan (erase cycles) of flash memory. Thus, block erase and page write operations are major factors effecting the low performance of flash memory [16] , [34] . Table 1 shows that the latency of flash memory writes ae nearly 2000µs compared to the 1µs latency of PCM. To address this issue, most FSSDs use DRAM as temporal storage (DRAM+Flash) for loading and offloading a small portion of logical-to-physical address mappings.
Given that DRAM is volatile (Table 1) and expensive with limited scalability to store the entire page-mapping image for large scale SSDs. FSSDs (DRAM+Flash) employing selective-page-level mapping FTLs [16] always suffer from cache miss penalty. This is due to the fetch operations on flash for the request location and its corresponding data that may take longer than expected (penalty) because of the several reasons. These include, prolonged delay due to ongoing processes such as GC/Write operation (access conflict). The request location should be established first via translation pages and then fetched from the data pages (i.e. 2 reads) that might take up to 200µs without interference compared to 3-4µs when fetched from PCM. As seen from Figure 2 (a), FSSD performance is affected by slow read operations and its failure to update in-place; thus, write operation speed is affected as there should always be free pages available. This means that a regular cleaning procedure (GC) should constantly claim dirty pages and avail clean pages, which causes access conflicts with incoming read/write operations, which have to wait till GC is completed. FSSD has another drawback when considering the processing for I/O requests. PSSD processes a single request (4KB I/O) by sending it to a single die other than spreading it to several dies, thus improving internal parallelism.
On the other hand, the PSSD from Figure 2 (b), even though experiencing a cache miss because of limited DRAM space, has the following four advantages: (i) Faster reads close to that of DRAM [15] , [22] , [35] . (ii) The ability to update in-place [36] , [37] resulting in faster writes. (iii) The absence of dirty blocks, thus resulting in an absence of GC operations [10] , thereby reducing the access conflicts for read/write operations that flash memory suffers from. (iv) Considering I/O requests processing, PSSD processes a single request (4KB I/O) by splitting it into smaller requests then send it to multiple dies via multiple channels to extend parallelism thereby improving the overall request throughput. This can be achieved through DMA from host straight to the data on PCM without passing through other firmware, which is a norm in FSSDs.
As previously discussed, PCM's shorter read/write latencies and its ability to be bit/unit overwritten while flash memory's write operation is performed by the unit of a page and the page cannot be overwritten without erasing the corresponding flash memory block causes its out-performance when compared to PCM as illustrated in Figure 3 . Here we evidenced that systems employing a cached mapping table in DRAM (DRAM+flash or DRAM+PCM) often experience some cache miss. When fetching the missing entry from secondary memory, mapping and data reads from PCM are faster than those from flash memory. Apart from this, flash memory reads/writes have a higher chance of being affected by ongoing GC/write operations (channel conflicts) than those of PCM because of the absence of erase/GC operations in the later.
Furthermore, the previous FTLs (PCM-FTL and PTL) cannot fully service SSDs that use PCM (PSSDs) as a user-data-store as they are based on the underlying flash pages. Our proposed PhaseFTL considers the fundamental data management change requirements for an efficient reliable PSSD. Moreover, PhaseFTL efficiently migrates frequently accessed hot data to colder regions of PCM by keeping track of I/O write counts. If a predefined threshold is reached, the writes are sequentially allocated to PCM (hot & cold regions) and then our mapping table is updated accordingly.
IV. DESIGN AND IMPLEMENTATION OF OUR PSSD
In this section, we discuss the design overview and functional goals of our proposed scheme and we discuss how the drawbacks of PCM are concealed using PhaseFTL. The goal of our efficient data distribution and management approach for a PCM-based storage devices (ExTENDS) is to improve read/write performance of storage media (e.g. SSD) using fast PCM cells as secondary memory. We achieve this by elimination of the out-of-place updates, which is a flash memory feature. In addition to the absence of GC, we further leverage speed of read/writes in PCM. As a result, the proposed PSSD experiences fewer access conflicts and lesser requests wait time compared to FSSDs. Our PhaseFTL facilitates the interface with the host system for logical-to-physical address translations. Furthermore, it handles an efficient wear leveling scheme and hides the energy consuming writes together with the endurance weaknesses of our proposed PSSD. Existing FTLs cannot efficiently manage an all PCM-SSD (PSSD) employing a DRAM+PCM or PCM+PCM hierarchy that is why we are proposing PhaseFTL. A. PROPOSED FUTURE PSSD ARCHITECTURE Figure 4 shows the storage array architecture of our PSSD (DRAM+PCM) which has a controller that manages the entire multi-channel PCM array. It coordinates I/O data transfers between the host and data blocks via a portion of mappings cached in DRAM. Both data and mapping tables are separated in PCM packages. The controller comprises of an I/O scheduler for managing requests through FIFO in our proposed system translation layer (PhaseFTL), and a wearleveling manager to control the physical data block placement, allocation, and updates. Multiple PCM channels link from the controller where each channel connection subdivides to service several packages. Each package comprises multiple dies. In Figure 5(b) , each die has a register that keeps track of all the operations taking place within individual planes. Every plane contain several blocks which in-turn consists of multiple pages as presented on Figure 5(a) .
B. PhaseFTL
This is a software module that we specifically designed to manage full PCM storage devices and to conceal the constraints of PCM, including slower writes, endurance, and power consumption, as shown in Figure 4 . We chose to use PhaseFTL because previous and existing FTLs could not meet all the functions of a well managed all-PCM device we are proposing. Moreover, there is need to have an FTL that can manage the DRAM+PCM hierarchy or PCM+PCM hierarchy approach while promoting both internal and external I/O processing parallelism and efficiently and evenly managing PCM wear.
PhaseFTL can utilize high bandwidth and support the high speed I/O response offered by PCM chips which cannot be handled at the application level as in [15] , because of a failure in completely leveraging this high bandwidth. Unlike in flash memory, PhaseFTL does not employ GC techniques because of the in-place-update feature that comes with PCM cells. Consequently, blocks do not need to be erased to avail space for new requests; instead, hot regions are switched to cold regions. This feature is important to sustain the lifespan of our PSSD. To hide the physical location change constraint of pages after wear-leveling (reliability), a mapping table is installed on an allocated space on PCM. 
1) PhaseFTL MAPPING TABLE
PhaseFTL manages read/write requests via a selective-pagelevel mapping scheme (PMS) [16] , [34] . This entire mapping image is located in PCM and is used to temporarily load and offload frequently accessed translation entries into the cached mapping table located in DRAM for faster address translations. Here, each logical page can be mapped to any physical page in PCM, meaning that it is a fully associative mapping. Figure 6 shows that the mapping table contains information between the logical page number (LPN) and the physical page number (PPN). When a read/write request is issued from the host, it arrives at the SSD with a logical sector number (LSN) that enables it to read from or write to a specific address on the PCM. Thus, an address translation is triggered (Algorithm 1 and Figure 6 demonstrates this). The logical sector size is 512Bytes which is the typical size of a physical sector in flash memory. Assuming that a PCM page can hold y entries, the address mapping between LSN and LPN can be expressed as follows:
(Algorithm 1 explains the mapping process flow when a cache miss/hit occurs in the cached mapping table in DRAM. The mapping entry is fetched from the entire mapping table image on PCM to establish the actual data location. Furthermore, our proposed scheme allocates an Access/Write Counter (WC) for each page in the Out-of-Bound (OOB) area. This allows PhaseFTL to separate hot pages from cold pages for an efficient wear-leveling process. To maintain consistency between the PPN and its corresponding LPN after wear-leveling, PhaseFTL also allocates an update bit in the OOB of PCM pages as shown on Algorithm 2 (line 7-15).
Placing mapping entries in the data area of PCM pages as opposed to in the OOB area allows us to group a larger number of mappings into a single page. Given that only 4 Bytes are required to represent a physical address [16] , [34] , grouping 512 mappings in the data region of a single PCM page will only take up 2 MB of a 1 GB PCM device. This results in an insignificant space overhead to our PCM device. We call such pages, translation pages and the blocks they are stored in are called translation blocks. The larger portion of the PCM is composed of actual user data on data pages that constitute data blocks, as seen in Figure 4 and Figure 6 . PhaseFTL employs a selective-cache-mapping scheme in DRAM that is loaded and offloaded to and from the entire mapping table image that persists on PCM. Figure 6 demonstrates how the LPN of a request is translated to its corresponding PPN using the mapping table stored on PCM.
2) READ/WRITE OPERATIONS
An incoming read-request goes through address translation, which is performed by the cached mapping table to retrieve a corresponding PPN (line 4 of Algorithm 2) of the request that points to the actual physical PCM location containing the requested data. The data is quickly retrieved and sent back to the controller, as seen in Figure 6 . This process is described in detail by Algorithm 1 and Algorithm 2. On the other hand, a write-request undergoes the same address translation 19 if request = read then 20 Send contents back to controller 21 end 22 end 23 end 24 Output: Requested data process as a read does; however, when the physical location has been found, the old entry is replaced with new entry inplace (lines 13-15 of Algorithm 2). If the write-count (WC) of the targeted page has reached a predefined threshold, contents of this page are moved to a colder location and the mapping table is updated in-place with a new corresponding PPN as described by lines 8 to 11 of Algorithm 2.
3) LIFESPAN MANAGEMENT
To prolong PCM lifetime, write-distribution should be balanced across entire PCM cells even though the endurance is several orders of magnitude better than flash memory. Our PhaseFTL employs a small portion of its selective-pagemappings cached in DRAM while its entire image resides in PCM and has a smaller effect on PCM endurance. The reason for this is that, PCM writes in-place and therefore propagates fewer updates on translation pages as compared to flash memory. Moreover, unlike with flash memory, the location of actual physical data addresses does not always change during or after every successive write operation. Nevertheless, frequent writes on the same page/block will pose a significant threat on the PCM and requires important consideration.
Considering the above, we employ a simple wear-leveling technique that exploits the WC located on every OOB page in PCM against a predefined write-threshold (TH). Our WC also helps us separate hot and cold data for efficient wear-leveling. In Algorithm 2, whenever WC exceeds TH, wear-leveling is triggered for that particular hot page and it is swapped with the next one. That is, the target hot page is moved to the next cold region and the victim on that region is migrated to the former hot page location (lines 8-15 of Algorithm 2). Initially our algorithm first migrates pages to the next physical address of the same block as seen in lines 8-11 of Algorithm 1. According to Figure 7 , the PPN of a candidate page always changes whenever it migrates to a cold region while its LPN remains static. When the page reaches the bottom (N ) at stage (d), it starts moving up the block again (c-a). This method not only preserves blocks with same pages but also helps to facilitate an easy addressing scheme and track syncing with the mapping table. Moreover, it reduces latency time that is spent during wear-leveling in particular. 
V. EVALUATION
We carried out extensive experimental assessments and performance evaluations between DRAM+Flash (FSSD) [34] , MONETA Model [1] , our proposed DRAM+PCM (PSSD 1 ) and PCM+PCM (PSSD 2 ). Our aim here is to investigate the best way to adopt PCM as secondary (PSSD 1 ) or as both secondary and main storage memory (PSSD 2 ) by carefully considering both situations for PCM adaptation for future storage systems. We also compared the two different address mapping algorithms, that is a block-mapping (BMT) and our proposed page mapping scheme (PMT), both implemented on PSSD 1 with same system configuration settings. PSSD 1 employs a selective-page-level mapping algorithm that allows it to adopt to the limited DRAM space while taking advantage of workload locality and caches only the frequently accessed requests in DRAM. As shown in Figure 8(a) , the entire image of translation pages persists in the PCM secondary memory. On the contrary, PSSD 2 is an all PCM SSD that adopts a full page mapping table on PCM and does not suffer from any cache miss penalties because all its page-mapping entries persists on PCM as demonstrated on Figure 8(b) . To evaluate the performance of our proposed PhaseFTL, we also ran experimental comparisons with existing FTL algorithms, i.e., Start-Gap [20] and PTL [12] . These two try to manage PCM cells on hybrid SSDs (PCM+Flash), and we implemented them together with our proposed PhaseFTL onto PSSD 1 and ran simulations.
A. MAPPING ALGORITHMS
As discussed in Section IV, our PSSD stores the entire pagelevel-mapping table image separately on PCM alongside data blocks. Frequently accessed mappings are temporarily loaded and unloaded from PCM onto a DRAM-cached mapping table, as illustrated in Figure 4 . This type of mapping approach is referred to as a selective-page-level-mapping scheme, and we refer to it as the PMT approach, which includes [34] and RFTL [16] and is affected by the cache miss penalty. On the contrary, the mapping table can also be cached and stored at block-level form for our proposed PSSD and we refer this approach as the BMT (Block Mapping Scheme). Considering that in BMT, we can store N pages in each PCM block, as depicted in Figure 7 , address mappings between Logical Page Number (LPN) and Logical Block Number (LBN) for this mapping table becomes:
Therefore, as the BMT size is N times smaller than our PMT (Equation 2), its entire mapping table can be cached in DRAM, resulting in a non-cache miss system. As a single LPN should be mapped to a fixed page offset in any physical block, we can then assume this as a direct mapping. Unlike page mapping, block mapping requires extra operations to serve a request, thereby affecting its performance. Consequently, one can chose whether PMT or BMT should be adopted; however, for our proposal, we opted for the former due to its better performance. 
B. EXPERIMENTAL SETUP 1) WORKLOADS
In our experiments, we ran simulations with a broad range of carefully selected realistic workloads representing large scale real-life applications for HPC and big data applications like Financial1 benchmark as presented in Table 3 . Our aim was to evaluate I/O patterns for workloads traces that includes MSNFS [38] , MSR [38] , RADIUS [38] and Fina-cial1 [39] represented in Table 3 . Some of these real-world traces were collected from a week long of block I/O enterprise severs and each trace contains multiple read/write requests with various size and intensity patterns. MSNFS comprises sequential writes while Finacial1 has small random writes. In contrast, RADIUS and MSR are composed of a read/write pattern mixture. That is, high write intensive (MSR-ts 0 , MSR-wdev 0 , RADIUS, MSR-stg 0 MSR-src 2 ), high read intensive patterns (MSNFS, MSR-hm 1 ), large sized request (MSR-stg 0 , MSNFS, MSR-usr 0 ) and small-sized request pattern (Finacial1, RADIUS).
During simulation, PSSD operations start with large sequential writes to the entire logical address space in order to populate the PSSD to a well-defined state. Once the sequential write are complete, two threads start. One performs random writes across the logical address space. The other performs random reads for synchronous workloads; for asynchronous workloads, either reads or writes can be performed simultaneously.
2) CONFIGURATION SETUP
We ran our simulations on the EagleTree simulator [40] as it can simulate multi-channel based SSDs and it is opensourced. We used this simulator to mimic 3D XPoint [6] even though there is not much sufficient internal information on this technology that has been revealed to the world and therefore, we added some hardware assumptions to our proposed PSSD. Furthermore, we modified the simulator to allow for various configurations of our specific PCM chips and support our proposed PhaseFTL and the two mapping schemes (PMT and BMT). We also modified the mapping tables location depending on the mapping scheme while facilitating for advanced commands like wear-leveling and pipelining. PSSD configurations were modeled on a 100GB multichannel PCM device that employs die-level parallelism; this is called internal parallelism wherein multiple dies can be accessed simultaneously for effective data allocation. This is facilitated by a PCIe link to 6 channels which also had 4 packages. Each package contained 2 dies which had several blocks and pages as shown in Table 2 . We also used standard timing for read/write access latency for our trace driven simulation according to Table 1 . Our I/O scheduler uses FIFO and wearleveling threshold ratio is set to 0.1% of PCM cell lifespan (10 8 writes) and can be varied depending on the designer's needs.
C. EXPERIMENTAL RESULTS
Our experiments focused on the performance and lifespan comparison between our proposed PSSD 1 (DRAM+PCM) and PSSD 2 (PCM+PCM). Both schemes employ PhaseFTL, which can either adopt a block mapping scheme (BMT) or a page mapping (PMT); of these two, we investigated to find out the best adoptable approach. Furthermore, we evaluated the performance of these two PSSDs against MONETA and legacy FSSD (DRAM+Flash) employing a selective-pagemapping FTL. To assess our PhaseFTL efficiency, we also compared it with PTL and Start-Gap wear-leveling techniques. The performance and lifespan observations are discussed in the following. 
1) PERFORMANCE
We considered similar test environments for PSSD 1 , PSSD 2 , MONETA and FSSD approaches using the various realistic workloads presented in Table 3 . Figure 9 shows the overall system write throughput where our proposed PSSDs (PSSD 1 and PSSD 2 ) show a performance improvement of 7% compared to MONETA and around 69% compared to the FSSD (DRAM+Flash) across various workload environments as also witnessed from Table 4 . FSSD suffers from slow writes (Table 1) caused by its the out-of-place-update character and high miss penalty when exposed to small random writes-dominant workloads. Moreover, the efficient management of writes on PCM chips induced by PhaseFTL, reduces maximum number of writes onto PCM cells and considerably speeds up the I/O request processing.
We further observed that between the two PSSDs, PSSD 1 (DRAM+PCM) shows better overall performance than PSSD 2 (PCM+PCM) even though PSSD 2 performed slightly better by 4.3% when exposed to RADUIS workload (Table 4 ). This is because PSSD 2 caches the entire mapping table on PCM; thus, it does not rely on the workload locality. Therefore, such write dominant workloads (e.g., RADIUS & Finan-cial2) with small random requests cause high cache misses on PSSD 1 because it only temporarily caches a small portion of mappings on restricted DRAM space, whereas the entire image of mappings stays on PCM. Contrary to PSSD 2 , PSSD 1 can fetch the physical location of such write requests faster via DRAM cache as compared to the former. Figure 10 depicts the overall read throughput in IOPS of the same SSDs where PSSD 1 outperforms the rest with an average of around 8.2% against MONETA and around 63% against FSSD as also witnessed from Table 5 . This is due to the cached mapping table in fast DRAM that PSSD 1 uses to process requests from the host. A portion of mappings are loaded and unloaded from PCM according to workload pattern (to take advantage of locality) while PSSD 2 and MONETA processes their mappings via slower PCM. Therefore, PSSD 1 can process write requests faster than PSSD 2 or MONETA and legacy FSSD, which comes last on both read and write performance speed. Furthermore, we also observed that read dominant workloads with smaller random request size, such as Financial1, increases the cache miss in PSSD 1 , leading to the request location being fetched from the entire mapping image stored on slower PCM secondary memory.
Observation 1:
ExTENDS breaks large requests into smaller chunks of 8KB page size and spreads them across cells for simultaneous and parallel execution thereby speeding up the overall request processing and fully exploiting the abundant system parallelism.
As the read speeds of PCM and DRAM are similar relative to FSSD, the caching hierarchy of using DRAM with PCM (PSSD 1 ) might lead to a significantly different performance compared to PCM-as-main-memory with PCM (PSSD 2 ). To evaluate this, we ran experiments between these two PSSDs (Figure 8 ) under various realistic workload environments to expose them to varied miss ratios. As shown in Figure 11 , we observed an average performance improvement of 18% on write throughput from PSSD 1 up to a miss ratio of 0.4, which then gradually dropped to the PSSD 2 performance level, meeting it at miss ratio of 0.85. The same scenario is also depicted by Figure 12 (Read Throughput) where PSSD 1 's performance maintains an average of around 20% better than PSSD 2 but then drops as the miss ratio increases. At 0% miss rate, PSSD 1 processed close to 200000I/Os compared with 1600000I/Os from PSSD 2 but after 85% miss rate, PSSD 1 is outperformed by PSSD 2 . This therefore shows that in fact, it is possible that with high-miss-rate scenarios of above 85%, PCM-as-main-memory approach (PSSD 2 ) would outperform the DRAM+PCM hierarchy (PSSD 1 ). Nevertheless, this is probably unlikely as it may require a (really) high miss rate.
Observation 2:
On average, the DRAM+PCM hierarchy approach (PSSD 1 ) will always out perform PSSD 2 unless on rare situations where the PSSD is exposed to extremely high miss ratio related workload environment.
The mapping table is one of the most frequently updated structures in SSDs and therefore would require an efficient FTL that would reduce such updates to prolong and speed up the storage device. Our proposed PhaseFTL try to manage such and at the same time improving the overall PSSD performance speed. To evaluate our FTL, we ran comparison experiments with PTL and Start-Gap schemes and recorded the maximum number of writes propagated to all the PCM cells of our PSSD. Our goal was to find the FTL which can reduce writes to PCM cells as performance improvement. As evidenced from Figure 13 , PhaseFTL outperforms its counterparts by effectively reducing the maximum number of writes by an average of 32% compared with PTL and over 70% reduction when compared with Start-Gap scheme. The reason being that Start-Gap for example, cannot evenly and effectively cool down hot regions because it only moves empty lines. Moreover, PTL and Stag-Gap schemes were designed to manage PCM as a main memory for FSSD and not as secondary storage for PSSDs. While BMT approach only maps a logical page to a fixed block offset, PMT approach maps a logical page to any physical page on PCM. In Figure 14 , BMT is outperformed by its counterpart PMT by an average of 21%. The reasons are as follows: (i) BMT the requested page's PPN, i.e., first a LBN is mapped to its corresponding PBN and a block offset is then retrieved to get the requested PPN pointing to the requested page's location. However, for page mapping, the case is different as a page's physical location is directly and easily mapped from the request's LPN to its PPN. For example, in Figure 14 , BMT processed around 1309889 IOPS while PMT only had 1139876 IOPS for MSR-hm 1 workload. (ii) Every logical page of BMT has a fixed block-offset and so a hot page cannot be moved freely to a cold block. It has to be accompanied by all other pages of the same block, creating an unnecessary extra write delay of the targeted pages during wear-leveling. Thus, block mapping suffers from this delay, which in-turn reduces write throughput.
2) LIFESPAN
Considering the limited endurance of PCM cells, the number of writes are restricted to 10 million (Table 1 ) on each cell. Therefore, we carefully considered existing FTLs and realized that, they cannot be implemented on an all-PCM storage device if we are to consider the former. It was also imperative to consider the extent to which our proposed PhaseFTL can balance and reduce PCM cell wear. We therefore compared PhaseFTL with Start-Gap and PTL under write dominant realistic workloads to assess the effects of writes on our PSSD cells. Figure 15 shows the number of writes propagated by each scheme on PCM cells, and PhaseFTL causes more writes than the rest with an average increase of 6.7% compared to PTL. This is because of the periodic movement of hot data to colder regions so as to improve PCM endurance. Consequently, PhaseFTL enables the PSSD to slowly and evenly wear. We calculated the number of writes propagated by each workload for both page mapping and block mapping approaches. On Figure 16 , we observed that for BMT, more extra-writes are incurred especially when wear-leveling is triggered. BMT reached up to 54389552 writes as compared to the 11389552 writes of PMT for MSR-src 2 workload. This is because of the fixed-block offset BMT rule that makes hot data pages unable to freely migrate around PCM cells free from their blocks. Thus, if certain pages in Block A for example, becomes hot, instead of migrating those particular pages to a cold Block B (like in PMT), all the pages in Block A are migrated to Block B. This consequently increases the overall number of writes, affecting PCM lifetime to a lesser extent as compared to PMT approach. From these observations, one can therefore trade-off between having a smaller mapping table thus availing more PCM space for data storage but suffer from lesser performance and extra-writes (BMT scheme) than having non-extra writes (PMT scheme) and have a relatively bigger and faster page-mapping table.
Observation 3: PhaseFTL achieves better endurance by introducing extra writes which have minimum/negligible effect on the PCM lifespan than the endurance they induce on the PCM.
VI. CONCLUDING REMARKS
In this paper, we studied the decade-long discrepancy between storage performance and microprocessor speed in computer systems that has remained untamed while CPU technology undergoes improvements. There is a need to revisit and improve our NVM technology in order to increase storage bandwidth, particularly of storage media. We investigated the current FSSD performance issues caused by the out-of-place update feature of flash memory. Moreover, flash memory has limiting access speed that prohibits it from direct mapping table access, thereby temporarily caching a small mapping table in DRAM for faster address translations.
To alleviate this, we present an idea of completely replacing flash memory (FSSD) with PCM (PSSD) for faster in-place updates and designed an efficient PCM Translation Layer (PhaseFTL) that conceals PCM constraints and efficiently manages PCM wear-leveling and I/O scheduling. Our experimental results show a performance improvement of 69% from our proposed PSSDs compared to traditional FSSD approaches. We also realized that the DRAM+PCM hierarchy (PSSD 1 ) performs better than PCM+PCM hierarchy (PSSD 2 ) even though the latter requires a (really) high miss rate of above 85% to outperform the former, which is probably an unlikely scenario. PhaseFTL also proved to be effective on improving the lifetime of PCM cells compared to previous studies. We further evaluated mapping schemes, PMT and BMT for PSSDs and our assessment results show that PMT performance outperforms BMT by 21% on average and is less harmful to the PCM lifespan.
