Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a modern PCM bank is implemented as a collection of partitions that operate mostly independently while sharing a few global peripheral structures, which include the sense amplifiers (to read) and the write drivers (to write). Based on this observation, we propose PALP, a new mechanism that enables partition-level parallelism within each PCM bank, and exploits such parallelism by using the memory controller's access scheduling decisions. PALP consists of three new contributions. First, we introduce new PCM commands to enable parallelism in a bank's partitions in order to resolve the read-write bank conflicts, with no changes needed to PCM logic or its interface. Second, we propose simple circuit modifications that introduce a new operating mode for the write drivers, in addition to their default mode of serving write requests. When configured in this new mode, the write drivers can resolve the read-read bank conflicts, working jointly with the sense amplifiers. Finally, we propose a new access scheduling mechanism in PCM that improves performance by prioritizing those requests that exploit partition-level parallelism over other requests, including the long outstanding ones. While doing so, the memory controller also guarantees starvation-freedom and the PCM's running-average-power-limit (RAPL).
). [27] . Second, the memory controller ensures that the power consumption of the active partitions within the bank is not too high, forcefully serializing requests to the bank otherwise, to guarantee the running average power limit (RAPL) [13] .
We implement PALP for DRAM-PCM hybrid main memory systems [7, 14, 21, 34-36, 38, 39, 50, 64, 66, 70] , which use DRAM as a cache to PCM. Given the still speculative state of the PCM technology, we describe PALP based on the architecture and memory timings of IBM's 20nm PCM prototype [37] . We evaluate PALP with workloads from the MiBench [20] and SPEC CPU2017 [8] benchmark suites. Our results show that PALP reduces average PCM access latency by 23%, and improves average system performance by 28% compared to the state-of-the-art approaches. As PALP exploits the PCM bank's partition-level parallelism using the memory controller's access scheduling decision, it can be easily combined with orthogonal mechanisms: (1) those aiming to reduce the number of write accesses to PCM [7, 14, 34] , and (2) those aiming to improve PCM's cell endurance [1, 54] .
Although our work is inspired by the notion of subarray-level parallelism (SALP) in DRAM [28] , exploiting parallelism within PCM banks is unique in the following two aspects. First, while many subarrays can be active simultaneously in a DRAM, PCM's peripheral structures allow only two partitions to be active simultaneously in a PCM bank [5] . Second, while subarray-level parallelism in DRAM can resolve any bank conflict, partition-level parallelism in PCM can resolve only the read-read or read-write bank conflicts. These unique properties of PALP lead to different performance trade-offs, which we present in Section 6.
The closest state-of-the-art PCM mechanisms, such as [68, 71] , have only addressed the readwrite bank conflicts in PCM, assuming a simple first-come-first-serve (FCFS) scheduling policy [51] . We demonstrate in Section 4.3 that the FCFS policy cannot efficiently exploit each PCM bank's partition-level parallelism. We not only resolve both the read-read and read-write bank conflicts in PCM, but also develop a new performance-oriented scheduling policy to better exploit such partition-level parallelism. Our PALP design is based on IBM's PCM chip interfaced with ARMv8-A (aarch64) processor using the DDR4 protocol. In Table 1 , we summarize the state-of-the-art mechanisms and highlight the contributions of this paper. 
BACKGROUND ON PCM
This section provides a brief background on PCM organization and operation required to understand PALP. PCM, like DRAM, is organized hierarchically [37] . An example PCM memory of 128GB capacity has 4 channels, with 4 ranks per channel, and 8 banks per rank. A PCM bank has 8 partitions, each of which is an arrays of 4096 wordlines and 256K bitlines. A PCM bank has 128 peripheral structures, which include the sense amplifiers (to read) and the write drivers (to write). The peripheral structures in PCM banks allow to read and program 128 PCM cells in parallel. Therefore, the read and write granularity is 128 bits (the size of a memory line). A PCM cell is built with the chalcogenide alloy (e.g., Ge 2 Sb 2 Te 5 [43] ), and is connected to a bitline and a wordline using an access device. The amorphous phase (RESET) in this alloy has higher resistance than the crystalline phase (SET). To RESET a PCM cell, a high current pulse of short duration is applied and quickly terminated. To SET a PCM cell, the chalcogenide alloy is heated above its crystallization temperature, but below its melting point for a long duration. Finally, to read the content of a PCM cell, a small electrical pulse is applied without inducing any phase change in the material. To serve a memory request that accesses data at a particular row and column address within a partition, a memory controller issues three commands to a PCM bank.
• ACTIVATE(A): activate the wordline and enable the access device for the PCM cells to be accessed. • READ(R)/WRITE(W): drive read or write current through the PCM cell. After this command executes, the data stored in the PCM cell is available at the output terminal of the sense amplifier, or the write data is programmed to the PCM cell. • PRECHARGE(P): deactivate the wordline and bitline, and prepare the bank for the next access.
The A-A interval (tRC) for the same bank is 47 cycles for a write request, and 19 cycles for a read request; A-W/R (tRCD) is 1 cycle; the read latency (RL) is 10 cycles; the write latency (WL) is 3 cycles; and the write recovery time (tWR) is 35 cycles. These timing parameters are based on a 266MHz memory clock and a DDR2 interface [37] (see also Table 5 for our simulation parameters). Figure 2 illustrates a PCM bank's peripheral structures in detail [17, 19, 37, 41, 59, 60] . There are 128 peripheral structures, which are shared across all bitlines in the bank. Each peripheral structure is connected to all the partitions in the bank. To simplify our discussion, we illustrate only one peripheral structure, which is connected to the two partitions i and j. The peripheral structure contains the sense amplifier (to read) and the write driver (to write), which are connected to the two partitions using the NMOS transistors M0, M1, M2, and M3. These transistors are turned ON or OFF, based on the partition that needs to be accessed. The bitline and wordline decoders are used to connect the PCM cell at a particular row and column address to the peripheral structure. Table 2 reports how the transistors M0, M1, M2, and M3 are configured to serve read and write requests. Observe that only one transistor is ON in the baseline PCM design (e.g., [2] ) to serve a read or write request from the bank.
ENABLING PARTITION-LEVEL PARALLELISM IN PCM

Resolving Read-write Bank Conflicts in PCM
To investigate what is actually needed to resolve the read-write bank conflicts in PCM, we take a closer look at the connections of M0, M1, M2, and M3. Observe that if transistors M0 and M3 (or M1 and M2) are simultaneously enabled, the sense amplifier can read a PCM cell from partition j, while the write driver is programming a PCM cell in partition i (or vice versa). Table 3 summarizes our findings. Some transistor configurations can lead to data corruption. For example, if M0 = OFF, ) , then the data from two different partitions (i.e., two different PCM cells) will be connected to the sense amplifier. This compromises the integrity of the data to be read at the output of the sense amplifier. We mark all such table entries as invalid. This is precisely the parallelism we seek in PCM banks, which can resolve read-write bank conflicts, only when the two conflicting requests are to two different partitions within the bank. To this end, we propose the following two architectural enhancements.
• First, the memory controller must issue two back-to-back ACTIVATE commands to the PCM bank to decode the wordline and bitline addresses in the two partitions. • Second, the memory controller must configure the transistors to connect the two partitions, one to the write drivers, and the other to the sense amplifiers.
To accomplish these actions, we introduce a new PCM command:
• READ-WITH-WRITE (RWW): connect the PCM bank's sense amplifiers and write drivers to the two decoded partitions. In the baseline PCM design, A-W-P takes 47 cycles and A-R-P takes 19 cycles (see Figure 3 ), contributing to a total of 66 cycles to serve these two requests. In our PCM design, A-RWW-P takes 47 cycles, during which the write recovery time (tWR) partially overlaps with the read latency (RL). The extra 1 cycle is due to the second ACTIVATE command. The total service latency is 48 cycles for serving a read and write request mostly in parallel, a reduction of 18 cycles (27%) over the baseline.
Resolving Read-read Bank Conflicts in PCM
From the write driver's internal circuit diagram shown in Figure 2 , we observe that the write driver can be viewed as a collection of two components -the write pulse shaper logic, which generates the current pulses necessary for the PCM cell's SET and RESET operations, and the verify logic, which verifies the correctness of these operations. These two circuit components together serve write requests from the bank using a PCM write scheme known as program-and-verify (P&V) [19, 42] . The verify logic essentially consists of two cross-coupled inverters, which can be configured as a sense amplifier, similar to the one that is already part of the peripheral circuit. Based on this observation, we propose simple circuit modifications to introduce the decoupling transistor M4 (see Figure 2 ), which can be configured when needed, to decouple the verify logic from the write pulse shaper logic. As a result of this modification, we introduce two operating modes for the write driver. In the decoupled mode, the verify logic can serve read requests concurrently with those served by the sense amplifiers. In the write mode, the verify logic, together with the write pulse shaper logic, can serve P&V-based write operations. Table 4 summarizes our findings. This is precisely the read-read parallelism we seek in a PCM bank to resolve a read-read bank conflict, where the two conflicting requests are to two different partitions within the bank. To enable this parallelism, we introduce two new PCM commands:
• READ-WITH-READ (RWR): connect the sense amplifiers and verify logic of the write drivers to the two decoded partitions.
To facilitate arbitration of the data bus for transferring two sets of read data back to the memory controller, we introduce two more transistors M5 and M6. In the write mode, M5 = OFF and M6 = ON. In the decoupled mode, the memory controller first retrieves the sense amplifiers' data in 8 cycles. It then sets M5 = ON and M6 = OFF in the next cycle, which enables the verify logic's data to be retrieved in the following 8 cycles. The data transfer latency from PCM to the memory controller is therefore 8 + 1 + 8 = 17 cycles. For this purpose, we introduce our final new PCM command:
• TRANSFER (T): sets M5 = ON and M6 = OFF. In the baseline PCM design, each A − R − P takes 19 cycles. The total service latency is therefore, 38 cycles to serve two read requests serially from partitions i and j. In our PCM design, the memory controller issues the following commands:
In our new design, RWR takes 10 cycles, during which the read latency (RL) of the two read requests are overlapped. The total service latency is therefore 1 + 1 + 1 + 10 + 17 = 30 cycles, considering 8+1+8 = 17 cycles for the data transfer from PCM to the memory controller as shown in Figure 4 (❷).
EXPLOITING PARTITION-LEVEL PARALLELISM IN PCM
This section describes our new memory access scheduling policy to exploit each PCM bank's partition-level parallelism, which we describe how to enable in Section 3.
High-level Overview
The key idea of our scheduling policy is to maximize partition-level parallelism as long as the running average power limit (RAPL) is not violated and a request does not become delayed too much.
At a high level, the memory controller maintains a read-write queue (rwQ) to store PCM requests. We implement rwQ as a FIFO. After scheduling a request from the rwQ, the memory controller checks to see if there is any outstanding request that can be scheduled exploiting the PCM bank's partition-level parallelism. If so, the request is scheduled along with the ongoing request, and the memory timing parameters are set accordingly. Otherwise, the memory controller selects the oldest request in the rwQ to be scheduled after completing the ongoing one. Figure 5 summarizes the flowchart of our new memory access scheduling policy.
Detailed Design
To explain our new memory access scheduling policy, we introduce the following notation N : total number of memory clock cycles elapsed P: running average power consumption RAPL: running average power limit [13] for the PCM device
We estimate the power consumed to resolve a read-read (R-R) and read-write (R-W) bank conflict as follows:
The timing parameters are as follows: 30 cycles for resolving the read-read (R-R) conflict, and 48 cycles for the read-write (R-W) conflict (see also Table 5 ). The pseudo-code of our new memory access scheduling policy is shown in Algorithm 1.
ALGORITHM 1:
Our new memory access scheduling policy.
, the number of clock cycles for which it is outstanding in the rwQ is within the backlogging threshold th b */ 3 next-request = select the oldest request in the rwQ that has bank conflict 4 end 5 Let next-request = a p l ; 6 W p = set of n w write requests in rwQ to partition p; 7 R p = set of n r read requests in rwQ to partition p;
; /* Find a request that can be concurrently scheduled with a 
Significance of PALP's Scheduling Policy
We provide some intuition, via an example, as to why (1) enabling each PCM bank's partitionlevel parallelism is not sufficient to significantly improve performance, unless there is a scheduling policy that explicitly exploits such parallelism, and (2) why PALP's scheduling policy outperforms the standard FCFS policy [51] , which is commonly used by many PCM memory controllers. Figure 6 illustrates an example showing how the memory controller schedules six PCM requests to the same bank. In (❶) we illustrate the FCFS policy of the Baseline [2] , where no more than one partition is active at any time. Using the timing parameters listed at the bottom of this figure, the total PCM service latency is 170 cycles.
In (❷) we illustrate the FCFS policy with partition-level parallelism in PCM banks. The PCM request R 1 127 is scheduled with W 3 120 by issuing the new PCM command RWW using partitions 1 and 3. This resolves the read-write bank conflict. The PCM request R 4 12 is scheduled with R 3 7 by issuing the new PCM command RWR using partitions 4 and 3. This resolves the read-read bank conflict. We note that requests W 1 89 and R 1 22 are both to partition 1, and therefore scheduled serially. In this example, more than one partition can be active, which enables the total PCM service latency to go down to 144 cycles, a reduction of 15.3% over the baseline.
Finally, in (❸) we illustrate one example schedule obtained using PALP's new memory access scheduling policy. The memory controller re-orders PCM requests in the read-write queue to maximize partition-level parallelism. We observe 1) request R 1 127 is scheduled with W 3 120 , and request R 4 12 is scheduled with W 1 89 by issuing two RWW PCM commands and 2) request R 3 7 is scheduled with Overall, PALP improves performance by 25.8% over the baseline ❷ in this example. In Section 6, we report PALP's application-level performance improvement for all our evaluated workloads.
EVALUATION METHODOLOGY
To evaluate PALP, we develop a full-system simulator with the following components.
• Gem5 [6] simulator frontend to simulate an ARMv8-A (aarch64) system [3] with 8 cores.
• We use a hybrid DRAM-PCM memory system. DRAMPower [10] is used to estimate its power consumption. • In-house cycle-level PCM simulator for 8GB, 16GB, and 32GB PCM with DDR4 interface. This is based on Ramulator [29] , a cycle-accurate main memory simulator. Power and latency parameters are based on IBM's 20nm PCM [37] , with DDR4 interface parameters [4] . Our simulator is available for download at [57] . Table 5 shows our simulation parameters.
Evaluated Techniques
We evaluate the following techniques:
• Baseline: The Baseline technique [2] maximizes PCM performance by serving multiple requests in parallel across all banks. It uses the FCFS policy and does not exploit partition-level parallelism. • MultiPartition: The MultiPartition technique [71] can resolve the read-write bank conflicts in PCM by exploiting the presence of partitions in a bank. The original design of [71] uses the FCFS policy, which provides very small benefit over the Baseline (according to our evaluations). For a fair comparison with PALP, we implemented out-of-order scheduling for this technique. This scheduling policy explicitly prioritizes requests that exploit read-write parallelism in PCM. Although [71] does not consider memory interface timings, we implement the MultiPartition technique with the DDR4 interface to fairly and accurately estimate its performance. cactusBSSN, bwaves, roms, parset, xz Mixed (parallel) applications AI-1 (4 copies each of deepsjeng, leela), AI-2 (4 copies each of mcf, exchange2), Visualization-1 (4 copies each of povray, blender), Visualization-2 (4 copies each of povray, imagick), Scientific (4 copies each of cactusBSSN, bwaves)
• PALP: Our mechanism enables each PCM bank's partition-level parallelism, and uses our new memory access scheduling policy to optimize performance by prioritizing requests that can exploit such partition-level parallelism. PALP can resolve both read-write and read-read bank conflicts.
The address mapping for all our evaluated systems is based on Micron's DDR4 Datasheet [40] . [27] . Even though exact numerical benefits differ, our mechanism works and improves performance for all evaluated address mappings. Table 6 reports the evaluated workloads. These workloads were chosen because they have at least 1 memory access per 1000 instructions out of the 64MB on-chip eDRAM cache. Although SPEC CPU2017 workloads are commonly used for evaluating high performance systems, recent works suggest that these workloads are also representative of many emerging applications that are regularly enabled by users on their mobile phones [44] .
Evaluated Workloads
Figures of Merit
We report the following figures of merit in this work to evaluate different mechanisms:
1. Execution Time: The time it takes to finish a workload.
Queuing Delay:
The total number of memory cycles spent by a request in the rwQ, averaged over all PCM requests. The delay of each request is measured as the time difference between when a request is inserted in the queue and the time when it is scheduled to PCM. 3. Access Latency: The sum of queuing delay and the PCM service latency, averaged over all PCM requests. 4. PCM Power Consumption: The total power consumed for activating partitions and peripheral structures within PCM banks. Figure 7 reports the execution time of each of our workloads for each of our evaluated systems normalized to the Baseline system. The simulator is configured for the default settings of 4MB eDRAM cache and a 8GB PCM. We make the following two observations. First, MultiPartition has higher performance than the Baseline (32% lower average execution time). This improvement is because MultiPartition resolves read-write bank conflicts in PCM. Second, PALP has the highest performance among all the three evaluated systems (PALP has 51% lower average execution time than the Baseline, and 28% lower average execution time than MultiPartition). This performance improvement is because 1) PALP resolves both read-read and read-write bank conflicts, and 2) PALP's memory access scheduling policy is optimized to maximize both read-read and read-write partition-level parallelism to achieve higher performance. Figure 8 reports the queuing delay of each of our workloads for each of our evaluated systems normalized to the Baseline system. The simulator is configured for the default settings of 4MB eDRAM cache and a 8GB PCM. We make the following two observations. First, the average queuing delay of MultiPartition is 34% lower than Baseline. This is because MultiPartition reduces the average delay of the outstanding write requests by scheduling them concurrently with read requests that are to different partitions within PCM banks (exploiting readwrite parallelism). Second, the average queuing delay of PALP is the lowest (52% lower than the Baseline and 26% lower than MultiPartition). This reduction is because PALP also reduces the queuing delay of read requests by scheduling them concurrently with those that are to different partitions within PCM banks (exploiting read-read parallelism). Figure 9 reports the access latency of each of our workloads for each of our evaluated techniques normalized to the Baseline, using the default settings of our simulator. We make the following two observations.
RESULTS AND DISCUSSION
Overall System Performance
Queuing Delay
Access Latency
First, the average access latency of MultiPartition is 31% lower than the Baseline. This reduction is a result of performance improvement due to exploiting read-write partition-level parallelism in PCM, as we have discussed in Section 3.1. Second, the average access latency of PALP is the lowest among all the three systems (47% lower than the Baseline, and 23% lower than MultiPartition). This reduction is due to the significant reduction of the PCM service latency achieved by exploiting both read-read and read-write partition-level parallelism in PCM banks. Figure 10 reports the average and peak power consumption of PALP for each of our workloads. The simulator is configured for the default settings of 4MB eDRAM cache and a 8GB PCM. We also report the RAPL limit, which is 0.4pJ per access. This limit is what is specified in the PCM datasheet [37] . We make the following two observations. First, both average and peak PCM power consumption of PALP is within the RAPL limit. The average PCM power is at least 0.08pJ/access lower than the RAPL limit, while the peak PCM power is at least 0.03pJ/access lower. Our memory access scheduler explicitly estimates the increase in power consumption when more than one partition is active to exploit partition-level parallelism, and serializes requests to PCM anytime the RAPL limit is estimated to be exceeded. None of the prior works take power consumption into account while exploiting PCM bank's partition-level parallelism. As a result, the RAPL limit cannot be guaranteed in these works. Second, our technique allows the exploration of performance and power trade-offs. We observe that reducing the RAPL limit to 0.35pJ/access will not hurt performance when our technique is employed. In fact, system designers can potentially use our technique to estimate the RAPL needed to achieve a desired performance target and distribute the surplus power budget to other system components.
PCM Power Consumption
Latency, Power, and Area Overheads
We evaluate the latency and power overhead of PALP against the Baseline PCM design using SPICE simulations [16] with 20nm technology files from [56] . For the Baseline design, we use IBM's PCM architecture [37] . In evaluating our design overheads, we use the industry standard low standby power (LSTP) multi-gate transistors. We also use copper interconnect parasitics [53] in our SPICE simulation to model wire delays. Finally, we modeled process variation using the guidelines presented in [63] . In Table 7 , we report the latency and power overheads of PALP. We observe that due to the new circuits that we introduce, the critical path delay has increased by 25.3%. However, the clock cycle time is 3.9ns for 256MHz rated memory clock, which is much higher than the critical path delay of 1453.2ps. We believe that the introduced logic is unlikely to create setup or hold violations at this rated clock frequency.
Our design also increases the power consumption by 17% over the Baseline design of a single peripheral structure consisting of a sense amplifier and our new write driver. Finally, we observe that PALP has an area overhead of 1.15% compared to the area of each peripheral structure. This overhead is negligible compared to the area of a 1GB PCM bank, which is 9.43 × 6.30 mm 2 at 20nm technology node. Figure 11 reports the execution time of each of our workloads with PALP using PCM capacities of 16GB and 32GB, normalized to the default configuration using PALP with 8GB PCM. The eDRAM capacity is configured to 4MB for all these configurations. We make the following two observations from our study.
Effect of PALP with Different PCM Capacities
First, for most workloads, we observe a very little performance improvement when the PCM capacity is increased from 8GB to 32GB. This is because these workloads have small working sets, for which a 8GB PCM is sufficient. Second, for xz we observe a significant performance improvement when we increase the PCM capacity to 16GB and 32GB due to xz's large working set. For this workload, the higher the number of PCM banks, the better the performance of PALP. This is because PALP can exploit more parallelism in more partitions in PCM that exist in more banks. Figure 12 reports the execution time of each of our workloads with PALP, normalized to the default configuration using PALP with 4MB eDRAM cache. We report results for PALP with eDRAM capacity of 8MB, 16MB, and 32MB. The PCM capacity is configured to 8GB. We make the following two observations.
Effect of Different eDRAM Capacities
First, for most workloads, as we increase the capacity of the eDRAM cache we observe a significant improvement in performance (lower execution time). This is because with a larger eDRAM cache capacity, more write requests are absorbed in the eDRAM, leaving only the read requests to be queued in the rwQ. This impacts the execution time in the following two ways: 1) the latency to service long write requests is reduced, and 2) the memory controller now has more flexibility to exploit read-read parallelism from the outstanding read requests. Second, for susan_smoothing, we observe a marginal change in performance when we increase the eDRAM capacity to 32MB. This is because there are only a small number of write requests in this workload to begin with, and therefore the workload's performance is insensitive to the size of the eDRAM cache, which buffers only write requests. Figure 13 reports the execution time of each of our workloads for PALP with DDR2 and DDR4 interfaces, normalized to the Baseline using the default settings of our simulator. We make the following two observations. [37] . Error bars represent minimum to maximum variation obtained by sweeping the RAPL limit from 0.2pJ/access to 0.4pJ/access. First, the performance of PALP with DDR2 and DDR4 interfaces are both better than the Baseline. The average execution time of PALP with the DDR2 interface is 33% lower than the Baseline, while that with the DDR4 interface is 51% lower than the Baseline. Second, we observe that the execution time of PALP decreases when the memory interface is changed from DDR2 to DDR4. The average performance improves by 27% when DDR4 interface is used. This increase is because the data transfer rate is doubled in DDR4 compared to DDR2. A complete exploration of all DRAM standards is a vast undertaking, and is beyond the scope of this work (see, for instance, [18] ). We conclude that PALP improves performance for multiple DDRx interface standards.
Effect of Different Interface Timings
Effect of Different Design Thresholds
6.9.1 RAPL Limit. Figure 14 summarizes the variation in PALP's execution time (normalized to the Baseline) and PCM power consumption for all workloads, when RAPL limit changes between 0.2-0.4pJ/access. The bar heights represent the execution time and power consumption using the default RAPL limit of 0.3pJ/access. The error bars represent variation when the RAPL limit is varied between 0.2pJ/access and 0.4pJ/access. We make the following three observations. First, different RAPL limits lead to different performance and power consumption trade-offs in all workloads. In other words, the RAPL limit can be set to achieve the desired performance and power targets. Second, setting a stricter RAPL limit results in lower performance (i.e., higher normalized execution time), while reducing the average power. We observe that for bwaves, setting the RAPL limit to 0.2pJ/access results in a performance improvement of only 11% over the Baseline, compared to the 33% when RAPL limit is set to 0.4pJ/access. Third, beyond the RAPL limit of 0.4pJ/access, we see no significant variation in either execution time or power consumption, which means that the RAPL limit for PCM can be safely reduced from its rated value of 0.4pJ/access [37] . 6.9.2 Backlogging Threshold th b . Figure 15 summarizes the variation in PALP's execution time normalized to Baseline for all workloads. Each bar height represents the execution time using the default backlogging threshold of 8 accesses. The error bars represent variation when the backlogging threshold is varied from 2 to 16. We make the following two observations. First, normalized execution time changes as we vary the backlogging threshold. This is because setting a lower threshold forces PALP to schedule outstanding requests sooner, prioritizing starvation freedom over performance (i.e., parallelism exploitation). This leads to a reduction in PALP's performance improvement. On the other hand, setting the backlogging threshold to a higher value offers more flexibility to PALP, allowing it to exploit partition-level parallelism more aggressively, thereby improving performance. Second, for workloads such as xz, there is no impact of varying the backlogging threshold, meaning that PALP can prioritize starvation freedom for these workloads.
Impact of PALP Design Decisions
6.10.1 Bank Conflicts and Scheduling. To estimate the impact of resolving both the read-read and read-write bank conflicts, and the access scheduling policy, in Figure 16 we report the execution time of PALP normalized to Baseline, with three configurations: (1) PALP resolving readwrite conflicts only (PALP-RW-FCFS), (2) PALP resolving both conflicts (PALP-RR-RW-FCFS), and (3) PALP resolving both conflicts with the new access scheduling policy (PALP-ALL). We make the following two observations. First, by resolving the read-write bank conflicts, PALP improves performance by only 7% over the Baseline. 1 When PALP resolves both read-write and read-read bank conflicts, performance improves by 32.2% over the Baseline. Second, by introducing our new scheduling policy, PALP's performance improves significantly (51.1% lower average execution time than the Baseline with PALP-ALL). We conclude that our choice of resolving both the read-read and read-write conflicts, and the new access scheduling policy are all essential to provide the highest performance benefits with PALP. 
RELATED WORKS
To our knowledge, this is the first work that 1) enables and exploits partition-level parallelism in phase-change memory to resolve read-read bank conflicts, and 2) designs a new memory access scheduling mechanism to aggressively exploit PCM's partition-level parallelism.
Read-While-Write in PCM
A patent application [5] describes read-while-write for PCM, where a read and write request can be scheduled simultaneously from a PCM bank using different partitions. However, no architectural technique is described on how to leverage this feature for system performance. Some earlier works such as [71] address architectural aspects assuming unrealistic system settings (such as infinite memory channel bandwidth). Our work not only addresses limitations of these prior works to resolve read-write bank conflicts, but also resolves read-read bank conflicts for the first time. We also evaluate PALP against a realistic version of [71] and find that PALP improves average system performance by 28%.
Performance/energy/endurance Improvement of PCM
Many prior works optimize performance and energy of PCM [2, 12, 31, 32, 47, 50, 65] . Cho et al.
propose Flip-N-Write [12] to improve PCM performance by first reading the memory content and then programming only the bits that need to be altered. Qureshi et al. propose PreSET [47] , an architectural technique that SETs the PCM cells of a memory location in the background before programming them during write. This improves performance by converting a write operation to a RESET operation of the PCM cells, which is faster. There are also techniques to consolidate multiple write operations [62] to reduce the number of cells that need to be programmed, saving energy and improving performance. To mitigate PCM's cell-level endurance problem, several wear-leveling techniques are proposed [1, 54] . PALP can be combined with these and similar techniques.
Writeback Optimization
Several prior works propose line-level writeback [30-32, 45, 46, 50] , where for each evicted DRAM cache block, processor cache blocks that become dirty are tracked and selectively written back to PCM. Various works propose dynamic write consolidation [33, 55, 58, 61, 62] , where PCM writes to the same row are consolidated into one write operation. Other works propose write activity reduction [24, 25] , where registers are allocated on CPUs to reduce costly write operations in PCM. Yet some other works propose multi-stage write operations [67, 69] , where a write request is served in several steps rather than in one-shot to improve performance. Qureshi et al. propose a morphable PCM system [48] , which dynamically adapts between high-density and high-latency MLC PCM and low-density and low-latency single-level cell PCM. Qureshi et al. propose write cancellation and pausing [49] , which allows PCM reads to be serviced faster by interrupting long PCM writes. Jiang et al. propose write truncation [26] , where a write operation is truncated to allow read operations, compensating for the loss in data integrity with stronger ECC. PALP is complementary to all these approaches.
Multilevel Cell PCM Optimizations
PCM cells can be used to store multiple bits per cell (referred to as multilevel cell or MLC). MLC PCM offers greater capacity per bit at the cost of asymmetric energy and latency in accessing the bits in a cell. Yoon et al. propose an architectural technique for data placement in MLC PCM [65] , exploiting energy-latency asymmetries. These techniques are also complementary to and can be combined with PALP.
CONCLUSION
We introduce PALP, a new mechanism that enables each PCM bank's partition-level parallelism, and exploits such parallelism using a new memory access scheduling mechanism. Previous architectural solutions to address parallelism in PCM banks can resolve only the read-write bank conflicts and assume an unrealistic memory interface with no timing constraints. We observe that (1) read-read bank conflicts far outnumber read-write bank conflicts, and (2) without designing a memory interface with realistic timing, the estimated performance improvements can be misleading. Based on our observations, we introduce PALP, which is built on three contributions. First, we introduce new PCM commands to enable parallelism in a bank's partitions in order to resolve readwrite bank conflicts, with modest changes needed to PCM logic or its interface. Second, we propose simple circuit modifications to resolve read-read bank conflicts. Third, we propose a new PCM access scheduling mechanism that improves performance by prioritizing those requests that exploit PCM bank's partition-level parallelism over other requests. While doing so, our new scheduling mechanism also guarantees starvation-freedom and the running-average-power-limit (RAPL) of PCM.
We evaluate PALP with workloads from the MiBench and SPEC CPU2017 Benchmark suites. Our results show that PALP reduces average PCM access latency by 23%, and improves average system performance by 28% compared to the state-of-the-art approaches.
We conclude that PALP is a simple yet powerful mechanism to improve PCM performance. We have open-sourced our infrastructure [57] to enable future work based on PALP.
