59 research outputs found
Main Memory Scaling: Challenges and Solution Directions
<p>The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck. At the same time, DRAM technology is experiencing difficult <em>technology scaling</em> challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques.</p>
<p>In this chapter, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we describe three major solution directions: (1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system (an approach we call <em>system-DRAM co-design</em>), (2) designing a memory system that employs emerging non-volatile memory technologies and takes advantage of multiple different technologies (i.e., <em>hybrid memory systems</em>), (3) providing predictable performance and QoS to applications sharing the memory system (i.e., <em>QoS-aware memory systems</em>). We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory.</p
Some Ideas and Principles for Achieving Higher System Energy Efficiency: Slack-Based, Coordinated, Hardware/Software Cooperative Management of Heterogeneous Resources
<p>This whitepaper briefly describes some broad ideas, principles, and research directions that seem promising, in the author’s opinion, to achieve a system that is overall more energy efficient (and also higher performance) than today’s systems.</p
Memory Scaling: A Systems Architecture Perspective
<p>The memory system is a fundamental performance and energy bottleneck in almost all computingsystems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more importantsystem bottleneck. At the same time, DRAM technology is experiencing difficult technology scalingchallenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques. In this paper, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we survey three key solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, 2) designing a memory system that employs emerging memorytechnologies and takes advantage of multiple different technologies, 3) providing predictable performance and QoS to applications sharing the memory system. We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory.</p
Research Problems and Opportunities in Memory Systems
<p>The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck. At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques.</p>
<p>In this article, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we describe three major new research challenges and solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system (an approach we call system-DRAM co-design), 2) designing a memory system that employs emerging non-volatile memory technologies and takes advantage of multiple different technologies (i.e., hybrid memory systems), 3) providing predictable performance and QoS to applications sharing the memory system (i.e., QoS-aware memory systems). We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory</p
Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems
Chip Multi-Processors are quickly growing to dozens and potentially hundreds of cores, and as such the design of the interconnect for on chip resources has become an important field of study. Of the available topologies, tiled mesh networks are an appealing approach in tiled CMPs, as they are relatively simple and scale fairly well. The area has seen recent focus on optimizing network on chip routers for performance as well as power and area efficiency. One major cost of initial designs has been their power and area consumption, and recent research into bufferless routing has attempted to counter this by entirely removing the buffers in the routers, showing substantial decreases in NoC energy consumption.
However, this research has shown that at high network loads, the energy benefits of bufferless schemes are vastly outweighed by performance degradation. When evaluated with pessimistic traffic patterns, the proposed router designs significantly increase network delays in last level cache traffic, and can lower the throughput of the system significantly.
We evaluate these router designs as one component of the entire memory hierarchy design. They are evaluated alongside simple cache mapping mechanisms designed to reduce the need for cross-chip network traffic, as well as packet prioritization mechanisms proposed for high performance.
We conclude, based on our evaluations, that with intelligent, locality-aware mapping of data to on-chip cache slices, bufferless network performance can get very close to buffered network performance. Locality-aware data mapping also significantly increases the network power advantage of bufferless routers over buffered ones</p
Memory Systems
<p>As shown in Figure 1.1, a computing system consists of three fundamental units: (i) units of computation to perform operations on data (e.g., processors, as we have seen in a previous chapter), (ii) units of storage (or memory) that store data to be operated on or archived, (iii) units of communication that communicate data between computation units and storage units. The storage/memory units are usually categorized into two: (i) memory system, which acts as a working storage area, storing the data that is currently being operated on by the running programs, and (ii) the backup storage system, e.g., the hard disk, which acts as a backing store, storing data for a longer term in a persistent manner. This chapter will focus on the “working storage area” of the processor, i.e., the memory system.</p>
<p>The memory system is the repository of data from where data can be retrieved and updated by the processor (or processors). Throughout the operation of a computing system, the processor reads data from the memory system, performs computation on the data, and writes the modified data back into the memory system – continuously repeating this procedure until all the necessary computation has been performed on all the necessary data.</p
The Main Memory System: Challenges and Opportunities
<p>The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck. At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques.</p>
<p>In this article, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we describe three major new research challenges and solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system (an approach we call system-DRAM co-design), 2) designing a memory system that employs emerging non-volatile memory technologies and takes advantage of multiple different technologies (i.e., hybrid memory systems), 3) providing predictable performance and QoS to applications sharing the memory system (i.e., QoS-aware memory systems). We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory</p
Ramulator: A Fast and Extensible DRAM Simulator
<p>Recently, both industry and academia have proposed many different roadmaps for the future of DRAM. Consequently, there is a growing need for an extensible DRAM simulator, which can be easily modified to judge the merits of today’s DRAM standards as well as those of tomorrow. In this paper, we present Ramulator, a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility. Unlike existing simulators, Ramulator is based on a generalized template for modeling a DRAM system, which is only later infused with the specific details of a DRAM standard. Thanks to such a decoupled and modular design, Ramulator is able to provide out-of-the-box support for a wide array of DRAM standards: DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, as well as some academic proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP). Importantly, Ramulator does not sacrifice simulation speed to gain extensibility: according to our evaluations, Ramulator is 2.5 faster than the next fastest simulator. Ramulator is released under the permissive BSD license.</p
Comparative Evaluation of FPGA and ASIC Implementations of Bufferless and Buffered Routing Algorithms for On-Chip Networks
<p>Most existing packet-based on-chip networks assume routers have buffers to buffer packets at times of contention. Recently, deflection-based bufferless routing algorithms have been proposed as an alternative design to reduce the area, power, and complexity disadvantages associated with buffering in routers. While bufferless routing shows significant promise at an algorithmic level, these algorithms have not been shown to be efficiently implementable in practice. Neither were they extensively compared to existing buffered routing algorithms in realistic designs. This paper presents our comparative evaluation of and experiences with realistic FPGA and ASIC designs of state-of-the-art (1) virtual-channel buffered, (2) deflection-based bufferless, and (3) deflection-based buffered routing algorithms using two different network topologies and network sizes. We show that bufferless routing algorithms are implementable without significant complexity, and compare their performance, area, frequency, and power consumption to their buffered counterparts. Our results indicate that bufferless routing can lead to significant area (38%), power consumption (30%), and router cycle time (8%) reductions over the best buffered router implementation on 65nm ASIC design, while operating at higher frequency.</p
A Case for Small Row Buffers in Non-Volatile Main Memories
<p>DRAM-based main memories have read operations that destroy the read data, and as a result, mustbuffer large amounts of data on each array access to keep chip costs low. Unfortunately, system-level trends such as increased memory contention in multi-core architectures and data mapping schemes that improve memory parallelism lead to only a small amount of the buffered data to be accessed. This makes buffering large amounts of data on every memory array access energy-inefficient; yet organizing DRAM chips to buffer small amounts of data is costly, as others have shown [11]. Emerging non-volatile memories (NVMs) such as PCM, STT-RAM, and RRAM, however, do not have destructive read operations, opening up opportunities for employing small row buffers without incurring additional area penalty and/or design complexity. In this work, we discuss and evaluate architectural changes to enable small row buffers at a low cost in NVMs. We find that on a multi-core system, reducing the rowbuffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with largerow sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.</p
- …