ABSTRACT e explosion of digital data and the high computation demands of data analysis have made the memory system a major contributor to the performance and power consumption of modern computing systems. Both industry and academia have proposed innovations such as new memory architectures, interfaces, devices, and topologies, that has lead to a vast increase in the design space of memory systems. In this paper, we present CramSim, which is a exible, extensible, and scalable simulation framework designed to help eciently explore the vast design space of memory systems. CramSim is designed on top of SST (Structural Simulation Toolkit) to decouple basic functional blocks of the memory system into separate components.
INTRODUCTION
For decades, DRAM (Dynamic Random Access Memory) has been the sole storage cell of computer main memory. Advances in manufacturing technology have improved the capacity and power consumption of DRAM. Also, DRAM vendors have continuously increased DRAM bandwidth by using performance enhancing technologies such as DDR (Double Data Rate) I/O signaling and by increasing I/O frequency. However, DRAM technology faces scaling challenges in density, power, and bandwidth due to fundamental limitations of the physical cell design. Meanwhile, memoryintensive applications are increasingly requiring more memory capacity and bandwidth.
To overcome these challenges, both industry and academia are exploring new memory systems that utilize emerging memory technologies such as 3D-stacked memory (HMC and HBM) and Non-Volatile Memory (FLASH, PCRAM, and STT-MRAM). Hybrid or multi-level memory systems that combine emerging technologies with conventional DRAM have been proposed to meet the Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). MEMSYS 2017, October 2-5, 2017, Alexandria, VA, USA © 2017 Copyright held by the owner/author(s). 978-1-4503-5335-9/17/10. . . $15.00 DOI: 10.1145/3132402.3132408 memory capacity, bandwidth, and power consumption requirements of future computing systems. Multi-level memory systems obviously have a larger design space to consider, including the organization, controller, interface, and memory device mix. e community needs a simulation infrastructure that is exible, extensible, and scalable to study and explore this design space.
Several DRAM simulators have been developed to evaluate current and new DRAM standards. DRAMSim2 [8] , the most popular simulator, provides a cycle-accurate DRAM model and a simple programming interface, but it only supports the DDR2/3 standards and a simple memory controller model. e USIMM simulator [1] provides multiple memory controller models, but it only supports the DDR3 standard. Recently, an extensible DRAM simulator, called Ramulator [5] , was developed. Even though it supports a variety of DRAM standards (e.g., DDR3/4, LPDDR3/4, GDDR5, and 3D-stacked DRAM) through its strong extensibility, it only o ers a simple memory controller model. DRAMsys [4] also supports a wide range of standards and has a exible controller architecture. However, it is not easy to integrate DRAMsys into other open-source full-system simulators because this simulator is implemented in SystemC TLM. Most full-system simulators, such as Gem5, have their own memory system model [3] . Even if these models are fast and provide reasonable accuracy, they are not portable to other fullsystem simulators. Also, none of the existing memory simulators are designed to simulate a large-scale memory system comprising a large number of controller and memory models.
In this paper, we present a new simulation infrastructure called CramSim (ContRoller And Memory SIMulator). CramSim is developed on top of Sandia's SST (Structural Simulation Toolkit) framework [7] to assist in the design space exploration of memory systems, in particular for hybrid or multi-level memory systems. Using SST's modular interface, CramSim decouples the primary functional blocks of the memory system into multiple component objects and enables con guration of the memory system by initializing and connecting the components with a Python le. CramSim can run with a request trace le in standalone mode, or it can run with other SST hardware components such as the processor core, on-chip cache, crossbar, and network router. We have also demonstrated CramSim working with the popular gem5 system simulator [2] .
CRAMSIM ARCHITECTURE
CramSim is designed to run with the SST framework to achieve its three design goals: exibility, extensibility, and scalability.
Flexibility. CramSim provides plenty of exibility in con guring the memory system to facilitate e cient exploration of the memory system design space. CramSim consists of one or more independent lanes which can have di erent con gurations such as the number of channels, memory device type, scheduling policy, Figure 1 . In each lane, there are two components: controller and memory. e controller component models a general memory controller architecture while the memory component models the physical characteristics of the corresponding memory device. A memory system is con gured by initiating components with the relevant parameters and by de ning connections between them with a Python script.
Extensibility. CramSim decouples key functional blocks of the memory system into separate components. Since each component is isolated from one another, we can implement each component separately, which enables easy extension of CramSim to support new memory standards, controller architectures, and memory devices.
Scalability. e SST framework supports parallel simulation with MPI (Message Passing Interface) to speed up simulation of large-scale systems. CramSim is thus able to simulate large-scale memory systems where there are a large number of memory controllers and device models.
In the following subsections, we describe each component of CramSim in more detail.
Transaction Dispatcher
e transaction dispatcher receives memory transactions from one or more processor cores or I/O devices. It arbitrates the transactions and allocates them to each lane depending on the speci ed allocation policy.
Controller
e memory controller manages data movement between memory devices and processor cores with ve essential functions: prioritizing memory requests, converting memory requests to device commands, mapping physical addresses to device addresses, scheduling device commands, and issuing commands while meeting memory access protocol requirements. In CramSim, these functions are implemented in pluggable sub-component objects inside the controller component so that we can easily build and evaluate various memory controller models that use di erent management policies and algorithms.
Address Mapper e major role of the address mapper is to convert the physical address of memory requests into device addresses. For DRAM devices, the physical address is mapped to a set of channel, rank, bank, row, and column indexes. Since the address mapping is a signi cant factor in determining system performance, CramSim provides a highly exible address mapper that can specify the mapping policy at the bit granularity.
Transaction Scheduler e transaction scheduler orders the candidate transactions (memory requests) for each channel for dispatching in a way that maximizes performance, fairness and maintains QoS ( ality of Service). CramSim currently includes two transaction scheduler models having di erent transaction queue structures: one has a uni ed queue while another has split read and write queues. Both transaction scheduler models o er FCFS (First-Come-First-Service) and FR-FCFS (First Ready, First-ComeFirst-Service) [6] scheduling policies.
Transaction Converter e transaction converter converts a transaction into memory commands and then stores the commands in a command queue. For example, a read request is converted to "ACTIVATE" and "READ" commands for DRAM devices. It can also be set up to generate "PRECHARGE" commands to model row-bu er management policies. e current CramSim provides a transaction converter model supporting close-page, open-page, and pseudo-open-page row-bu er policies for DRAM devices.
Command Scheduler e command scheduler picks memory commands from command queue in a way that maximize the bank-, rank-, or channel-level parallelism while enforcing the timing constraints of memory devices. CramSim currently provides two command scheduling policies: bank round-robin and rank roundrobin. e bank round-robin policy searches for issuable commands in the per-bank queues for a given rank, and then moves to the next rank. e rank round-robin policy searches through the per-bank queues of all ranks for a given bank.
Device driver e device driver maintains state information for the memory devices and provides that information to the command scheduler. is state information enables the scheduler to identify commands that can be dispatched to the memory device in any cycle. e device driver also handles device-speci c control mechanisms. For example, it can be con gured to periodically generate "REFRESH" commands to model the refresh operation of DRAM devices.
Memory
e memory component models the physical characteristics of memory devices and their internal functionalities that are hidden from the memory controller. It collects internal events of memory devices such as row activations, row bu er hits, bank precharges, etc. e collected data is used to calculate power consumption of memory devices along with their power models. CramSim currently provides a memory component model for DRAM (DDR3/4 and HBM). e component includes an equation-based power model similar to that used in DRAMSim2 and USIMM. e memory component also tracks completion of commands and sends responses to the controller component. e memory component can be con gured to store data on write requests and to provide that data on read requests to work with full-system simulators that require memory storage.
Con guration
CramSim con gures the memory system organization, memory controller, and device types with a Python le.
e timing and power parameters of each component can be directly speci ed within the Python con guration le, included in a separate con guration le, or provided in the command line. Currently, CramSim provides con guration les for DRAM devices (i.e., DDR3/4 and HBM).
VALIDATION AND EVALUATION
To validate the basic functionality of CramSim, we developed an equation-based analytical framework called VeriMem.
e VeriMem suite is a set of simple read and write request streams that we can easily reason about the expected performance of the memory system. Each stream includes requests that hop among various structures within the DRAM: rows, columns, banks, etc. with varying frequencies. Each stream is thus impacted by a discreet and known set of timing parameters. e VeriMem suite enabled us to work out many bugs in the early development stages of CramSim, which now passes all tests. e VeriMem suite is also used regularly to ensure that new code has not introduced any timing bugs. In addition, we compared the simulation results of CramSim with that of IBM's internal simulation model of the DRAM systems for complex request traces.
We also compared CramSim with Ramulator [5] , which is one of the state-of-the-art memory simulators, to evaluate its simulation accuracy and scalability. Table 1 shows the simulated transaction throughput of three DRAM standards (DDR3, DDR4, and HBM) for 100M random and sequential traces. As we can see in the table, two simulators estimate almost same throughput for each DRAM standard. To evaluate the scalability, we increase the number of channels and linearly increase the number of the per-cycle transaction according to the number of channels. In the case of CramSim, we con gured the multi-channel memory system to have multiple lanes that simulate each channel, and then we run CramSim in the parallel simulation mode with MPI. As shown in Table 2 , CramSim consumes only 67% more time when simulating an 8-channel con guration with heavy tra c than when simulating a 1-channel con guration.
CONCLUSION
In this paper, we have introduced CramSim, a exible, extensible and scalable simulation framework for memory systems. anks to its modular design, each component of the memory systems can be modeled separately and is easily pluggable into the simulation framework. CramSim enables e cient exploration of the signi cant design space of future memory systems.
